Neptune API error handling in callbacks
The Run
object exposes four parameters that serve as callbacks for various error or warning scenarios:
on_network_error_callback
Handles low-level network errors that occur during HTTP requests. These errors include:
- Read, write, and connect timeouts
- Malformed requests
- Connection failures
Note: This callback is called only when the retry mechanism fails.
on_warning_callback
Called in a few specific scenarios:
- You're creating a run with an ID that already exists.
- You're trying to fork a run which doesn't exist.
- You're sending a point to a metric which is exactly the same as the latest point in this metric.
on_error_callback
Umbrella callback for various issues. Includes a mix of error classes:
- API authorization errors. For example, you don't have permissions to write to a project.
- Errors in the lifecycle of the local process synchronizing data to Neptune. For example, the process exited unexpectedly.
- Semantic errors, such as:
- You're trying to write to a run that doesn't exist.
- You're trying to fork a run, but its parent doesn't exist.
- You're trying to create a run, but the creation parameters are invalid.
- You're trying to write a point to a metric with non-increasing step or timestamp.
on_queue_full_callback
(unused)
This additional parameter is currently unused.
Override the default error handling
We recommend to always provide error handling callbacks explicitly. There are two main directions to optimize for:
-
A) Never stop the training process, even at the cost of some data not appearing in Neptune.
In this case, we recommend setting all callbacks to something like:
def _my_callback(exc: BaseException, ts: Optional[float]) -> None:
logger.warning(f"Encountered {exc} error") -
B) Ensure correct and complete data, even at the cost of stopping the training process.
This scenario is more complex and requires handling exceptions on case-by-case basis.
Example
def _my_callback(exc: BaseException, ts: Optional[float]) -> None:
if isinstance(exc, NeptuneSynchronizationStopped):
# The process synchronizing logged data to Neptune exited
run.terminate()
elif isinstance(exc, NeptuneFloatValueNanInfUnsupported):
# We're trying to log NaN/Inf, which is currently not supported
logger.warning(f"Failed to log NaN/Inf metric value")
...
else:
run.terminate()
For the full set of possible exceptions, see the source code on GitHub.