Skip to content

[ENH] Simplified Publish API with Automatic Type Recognition#1554

Open
Omswastik-11 wants to merge 26 commits intoopenml:mainfrom
Omswastik-11:prototype-publish
Open

[ENH] Simplified Publish API with Automatic Type Recognition#1554
Omswastik-11 wants to merge 26 commits intoopenml:mainfrom
Omswastik-11:prototype-publish

Conversation

@Omswastik-11
Copy link
Contributor

@Omswastik-11 Omswastik-11 commented Dec 24, 2025

initially

from openml_sklearn.extension import SklearnExtension
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)
extension = SklearnExtension()# User instantiates the extension object
knn_flow = extension.model_to_flow(clf) # User manually converts the model (estimator instance) to an OpenMLFlow object
knn_flow.publish()

API

from sklearn.neighbors import KNeighborsClassifier
import openml_sklearn  # Register the extension
import openml

clf = KNeighborsClassifier(n_neighbors=3)

openml.publish(clf)

@Omswastik-11 Omswastik-11 changed the title [ENH] improve publish api for users [ENH] Simplified Publish API with Automatic Type Recognition Dec 24, 2025
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this is a draft still, some early comments.

  • works for flows only, I would recommend to try for at least two different object types to see the dispatching challenge there.
  • do the extension checking inside publish and not in the usage example

@Omswastik-11
Copy link
Contributor Author

Omswastik-11 commented Dec 25, 2025

Thanks @fkiraly !!
I checked on flow , datset , task . it is working correctly but in run it is getting some server side issues.

Task 1 failed: https://test.openml.org/api/v1/xml/data/features/1 returned code 274: No features found. Additionally, dataset processed with error - None

@Omswastik-11 Omswastik-11 requested a review from fkiraly December 25, 2025 08:07
@jgyasu
Copy link
Contributor

jgyasu commented Dec 31, 2025

The PR description is not entirely correct. This is how the interface looks currently:

from openml_sklearn.extension import SklearnExtension
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)
extension = SklearnExtension()# User instantiates the extension object
knn_flow = extension.model_to_flow(clf) # User manually converts the model (estimator instance) to an OpenMLFlow object
knn_flow.publish()

But I like the idea of a unified publish. I am currently working on a design document for refactoring Extension and this design coincides mine as well, which is a good thing.

@Omswastik-11
Copy link
Contributor Author

The PR description is not entirely correct. This is how the interface looks currently:

from openml_sklearn.extension import SklearnExtension
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)
extension = SklearnExtension()# User instantiates the extension object
knn_flow = extension.model_to_flow(clf) # User manually converts the model (estimator instance) to an OpenMLFlow object
knn_flow.publish()

But I like the idea of a unified publish. I am currently working on a design document for refactoring Extension and this design coincides mine as well, which is a good thing.

Thanks for the correction I used the syntax example used in example script . this unified publish was Franz's idea . https://github.com/gc-os-ai/openml-project-dev/issues/8

@Omswastik-11 Omswastik-11 marked this pull request as ready for review January 1, 2026 11:07
@codecov-commenter
Copy link

codecov-commenter commented Jan 6, 2026

Codecov Report

❌ Patch coverage is 36.84211% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.63%. Comparing base (25ba6f8) to head (605daad).

Files with missing lines Patch % Lines
openml/publishing.py 33.33% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1554      +/-   ##
==========================================
- Coverage   52.70%   52.63%   -0.07%     
==========================================
  Files          37       38       +1     
  Lines        4385     4404      +19     
==========================================
+ Hits         2311     2318       +7     
- Misses       2074     2086      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jgyasu
Copy link
Contributor

jgyasu commented Jan 13, 2026

I have added some comments. I also feel we should not populate __init__.py with these functions, we can have them in a seperate file and use __init__.py only for imports.

@Omswastik-11 Omswastik-11 requested a review from jgyasu January 13, 2026 15:00
Copilot AI review requested due to automatic review settings February 26, 2026 10:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 27, 2026 09:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 2, 2026 07:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 2, 2026 07:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 2, 2026 07:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

openml/init.py:33

  • openml.publish(...) is used by the new tests/examples, but the package top-level does not expose a publish attribute (only the publishing submodule is imported). This will raise AttributeError for openml.publish(...). Re-export the function from openml.publishing (and add it to __all__), or update the call sites to use openml.publishing.publish(...) consistently.
from . import (
    _api_calls,
    config,
    datasets,
    evaluations,
    exceptions,
    extensions,
    flows,
    publishing,
    runs,
    setups,
    study,
    tasks,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 11, 2026 09:30
Co-authored-by: Armaghan Shakir <raoarmaghanshakir040@gmail.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 55 to +62
# ## Upload the machine learning experiments to OpenML
# First, create a fow and fill it with metadata about the machine learning model.
#
# ### Option A: Automatic publishing (simplified)
# The publish function automatically detects the model type and creates the flow:

# %%
knn_flow = openml.flows.OpenMLFlow(
# Metadata
model=clf, # or None, if you do not want to upload the model object.
name="CustomKNeighborsClassifier",
description="A custom KNeighborsClassifier flow for OpenML.",
external_version=f"{sklearn.__version__}",
language="English",
tags=["openml_tutorial_knn"],
dependencies=f"{sklearn.__version__}",
# Hyperparameters
parameters={k: str(v) for k, v in knn_parameters.items()},
parameters_meta_info={
"n_neighbors": {"description": "number of neighbors to use", "data_type": "int"}
},
# If you have a pipeline with subcomponents, such as preprocessing, add them here.
components={},
)
knn_flow.publish()
print(f"knn_flow was published with the ID {knn_flow.flow_id}")
knn_flow = openml.publish(clf, tags=["openml_tutorial_knn"])
print(f"Flow was auto-published with ID {knn_flow.flow_id}")
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tutorial now uses openml.publish(clf, ...) which requires an installed/registered scikit-learn extension (typically openml-sklearn). Since the script doesn’t import openml_sklearn or mention the dependency, users running the example without that extra will get a ValueError. Consider adding a short note (or an explicit import openml_sklearn # noqa: F401) near the top so the example is self-contained and the requirement is clear.

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +91
if tags and hasattr(obj, "tags"):
existing = obj.tags or []
if all(isinstance(tag, str) for tag in existing):
obj.tags = list(dict.fromkeys([*existing, *tags]))
if name is not None and hasattr(obj, "name"):
obj.name = name
return obj.publish()
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tags is typed as Sequence[str], but at runtime passing a single string (e.g., tags="foo") will be treated as an iterable of characters and will silently add "f", "o", "o". It would be safer to validate that tags is not a str (and that all provided tags are strings) and raise a clear TypeError/ValueError when the input is invalid.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to give this some more consideration. Here are my thoughts so far.

Currently, all OpenML entities that are publishable have a publish() method. In that sense, the API is already unified. In my view, the benefit of this is therefor mainly that it can extend to new object types, such as estimators for which an extension is registered that can serialize it to a Flow. So it seems there are now two alternatives:

dataset = OpenMLDataset(name="foo", tags=["bar"], ...)
dataset.publish()

or

dataset = OpenMLDataset(...)
openml.publish(dataset, name="foo", tags=["bar"])
# of course, name and tags could also have been provided during initialisation of the OpenMLDataset

So for anything that is already an OpenML object, I do not really see the benefit. It just introduces two different ways to do things, which I would generally be against. (I assume here the intention is for the publish method to remain on the OpenML objects as well.)

For estimators that are to be converted to flows, this is of course significantly shorter as shown in the original proposal. However, I think it would also be worth considering an alternative. Consider that perhaps instead an OpenMLFlow could be initialised with an arbitrary object which would be attempted to be resolved with extensions. Then it could also provide a similarly smooth experience:

estimator = sklearn.tree.DecisionTreeClassifier()
flow = OpenMLFlow(estimator, name="foo", tags=["bar"])
flow.publish()

Sure, it introduces an extra line of code, but it does make it explicit to the user what kind of OpenML object they are publishing.

I am trying to think of other categories of objects that would be parsed into OpenML objects that would also benefit from this general publish function, but I can't really think of any. In most cases, I think it would be far more useful to have one dedicated function to create the object and thus communicate the metadata schema. E.g., a dataset can have a name, description, author, and so on, and I do not think this is something that would translate well into a general publish call (thinking of e.g., openml.publish(dataframe, name="...", description="...", ...) but where the type hints cannot provide information as so what metadata is valid).

In any case, the coupling of object creation with publication to the platform is problematic in the case where users do not have an internet connection. This can be the case where e.g., a user prepopulates their cache when they have a connection and then executes experiments in an offline setting (most commonly some compute server setups, but potentially also something like a regular outage). While we do provide some utility functions that do this (like run_model_on_task, though there too we made sure it could run offline with the right arguments), I am hesitant to introduce that as the main way to create/share objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants