[ENH] Simplified Publish API with Automatic Type Recognition#1554
[ENH] Simplified Publish API with Automatic Type Recognition#1554Omswastik-11 wants to merge 26 commits intoopenml:mainfrom
Conversation
fkiraly
left a comment
There was a problem hiding this comment.
I get this is a draft still, some early comments.
- works for flows only, I would recommend to try for at least two different object types to see the dispatching challenge there.
- do the extension checking inside
publishand not in the usage example
|
Thanks @fkiraly !! |
|
The PR description is not entirely correct. This is how the interface looks currently: from openml_sklearn.extension import SklearnExtension
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)
extension = SklearnExtension()# User instantiates the extension object
knn_flow = extension.model_to_flow(clf) # User manually converts the model (estimator instance) to an OpenMLFlow object
knn_flow.publish()But I like the idea of a unified |
Thanks for the correction I used the syntax example used in example script . this unified publish was Franz's idea . https://github.com/gc-os-ai/openml-project-dev/issues/8 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1554 +/- ##
==========================================
- Coverage 52.70% 52.63% -0.07%
==========================================
Files 37 38 +1
Lines 4385 4404 +19
==========================================
+ Hits 2311 2318 +7
- Misses 2074 2086 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I have added some comments. I also feel we should not populate |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
openml/init.py:33
openml.publish(...)is used by the new tests/examples, but the package top-level does not expose apublishattribute (only thepublishingsubmodule is imported). This will raiseAttributeErrorforopenml.publish(...). Re-export the function fromopenml.publishing(and add it to__all__), or update the call sites to useopenml.publishing.publish(...)consistently.
from . import (
_api_calls,
config,
datasets,
evaluations,
exceptions,
extensions,
flows,
publishing,
runs,
setups,
study,
tasks,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Armaghan Shakir <raoarmaghanshakir040@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # ## Upload the machine learning experiments to OpenML | ||
| # First, create a fow and fill it with metadata about the machine learning model. | ||
| # | ||
| # ### Option A: Automatic publishing (simplified) | ||
| # The publish function automatically detects the model type and creates the flow: | ||
|
|
||
| # %% | ||
| knn_flow = openml.flows.OpenMLFlow( | ||
| # Metadata | ||
| model=clf, # or None, if you do not want to upload the model object. | ||
| name="CustomKNeighborsClassifier", | ||
| description="A custom KNeighborsClassifier flow for OpenML.", | ||
| external_version=f"{sklearn.__version__}", | ||
| language="English", | ||
| tags=["openml_tutorial_knn"], | ||
| dependencies=f"{sklearn.__version__}", | ||
| # Hyperparameters | ||
| parameters={k: str(v) for k, v in knn_parameters.items()}, | ||
| parameters_meta_info={ | ||
| "n_neighbors": {"description": "number of neighbors to use", "data_type": "int"} | ||
| }, | ||
| # If you have a pipeline with subcomponents, such as preprocessing, add them here. | ||
| components={}, | ||
| ) | ||
| knn_flow.publish() | ||
| print(f"knn_flow was published with the ID {knn_flow.flow_id}") | ||
| knn_flow = openml.publish(clf, tags=["openml_tutorial_knn"]) | ||
| print(f"Flow was auto-published with ID {knn_flow.flow_id}") |
There was a problem hiding this comment.
This tutorial now uses openml.publish(clf, ...) which requires an installed/registered scikit-learn extension (typically openml-sklearn). Since the script doesn’t import openml_sklearn or mention the dependency, users running the example without that extra will get a ValueError. Consider adding a short note (or an explicit import openml_sklearn # noqa: F401) near the top so the example is self-contained and the requirement is clear.
openml/publishing.py
Outdated
| if tags and hasattr(obj, "tags"): | ||
| existing = obj.tags or [] | ||
| if all(isinstance(tag, str) for tag in existing): | ||
| obj.tags = list(dict.fromkeys([*existing, *tags])) | ||
| if name is not None and hasattr(obj, "name"): | ||
| obj.name = name | ||
| return obj.publish() |
There was a problem hiding this comment.
tags is typed as Sequence[str], but at runtime passing a single string (e.g., tags="foo") will be treated as an iterable of characters and will silently add "f", "o", "o". It would be safer to validate that tags is not a str (and that all provided tags are strings) and raise a clear TypeError/ValueError when the input is invalid.
PGijsbers
left a comment
There was a problem hiding this comment.
I have to give this some more consideration. Here are my thoughts so far.
Currently, all OpenML entities that are publishable have a publish() method. In that sense, the API is already unified. In my view, the benefit of this is therefor mainly that it can extend to new object types, such as estimators for which an extension is registered that can serialize it to a Flow. So it seems there are now two alternatives:
dataset = OpenMLDataset(name="foo", tags=["bar"], ...)
dataset.publish()or
dataset = OpenMLDataset(...)
openml.publish(dataset, name="foo", tags=["bar"])
# of course, name and tags could also have been provided during initialisation of the OpenMLDatasetSo for anything that is already an OpenML object, I do not really see the benefit. It just introduces two different ways to do things, which I would generally be against. (I assume here the intention is for the publish method to remain on the OpenML objects as well.)
For estimators that are to be converted to flows, this is of course significantly shorter as shown in the original proposal. However, I think it would also be worth considering an alternative. Consider that perhaps instead an OpenMLFlow could be initialised with an arbitrary object which would be attempted to be resolved with extensions. Then it could also provide a similarly smooth experience:
estimator = sklearn.tree.DecisionTreeClassifier()
flow = OpenMLFlow(estimator, name="foo", tags=["bar"])
flow.publish()Sure, it introduces an extra line of code, but it does make it explicit to the user what kind of OpenML object they are publishing.
I am trying to think of other categories of objects that would be parsed into OpenML objects that would also benefit from this general publish function, but I can't really think of any. In most cases, I think it would be far more useful to have one dedicated function to create the object and thus communicate the metadata schema. E.g., a dataset can have a name, description, author, and so on, and I do not think this is something that would translate well into a general publish call (thinking of e.g., openml.publish(dataframe, name="...", description="...", ...) but where the type hints cannot provide information as so what metadata is valid).
In any case, the coupling of object creation with publication to the platform is problematic in the case where users do not have an internet connection. This can be the case where e.g., a user prepopulates their cache when they have a connection and then executes experiments in an offline setting (most commonly some compute server setups, but potentially also something like a regular outage). While we do provide some utility functions that do this (like run_model_on_task, though there too we made sure it could run offline with the right arguments), I am hesitant to introduce that as the main way to create/share objects.
initially
API