You can view the usage status of all AI EasyMaker resources in the dashboard.
Displays the number of resources in use per resource.
Create and manage Jupyter notebook with essential packages installed for machine learning development.
Create a Jupyter notebook.
Image: Select OS image to be installed on the notebook instance.
Notebook Information
Storage
/root/easymaker
directory path. Data on this storage is retained even when the notebook is restarted.nas://{NAS ID}:/{path}
.Additional Settings
[Caution] When using NHN Cloud NAS: Only NHN Cloud NAS created on the same project as AI EasyMaker is available to use.
[Note] Time to create notebooks: Notebooks can take several minutes to create. Creation of the initial resources (notebooks, training, experiments, endpoint) takes additional few minutes to configure the service environment.
A list of notebooks are displayed. Select a notebook in the list to check details and make changes to it.
Status: Status of the notebook is displayed. Please refer to the table below for the main status.
Status | Description |
---|---|
CREATE REQUESTED | Notebook creation is requested. |
CREATE IN PROGRESS | Notebook instance is in the process of creation. |
ACTIVE (HEALTHY) | Notebook application is in normal operation. |
ACTIVE (UNHEALTHY) | Notebook application is not operating properly. If this condition persists after restarting the notebook, please contact customer service center. |
STOP IN PROGRESS | Notebook stop in progress. |
STOPPED | Notebook stopped. |
START IN PROGRESS | Notebook start in progress |
REBOOT IN PROGRESS | Notebook reboot in progress. |
DELETE IN PROGRESS | Notebook delete in progress. |
CREATE FAILED | Failed to crate notebook. If keep fails to create, please contact Customer service center. |
STOP FAILED | Failed to stop notebook. Please try to stop again. |
START FAILED | Failed to start notebook. Please try to start again. |
REBOOT FAILED | Failed to reboot notebook. Please try to start again. |
DELETE FAILED | Failed to delete notebook. Please try to delete again. |
Action > Open Jupyter Notebook: Click Open Jupyter Notebook button to open the notebook in a new browser window. The notebook is only accessible to users who are logged in to the console.
Tag: Tag for notebook is displayed. You can change the tag by clicking Change.
Monitoring: On the Monitoring tab of the detail screen that appears when you select the notebook, you can see a list of monitored instances and a chart of basic metrics.
AI EasyMaker notebook instance provides native Conda virtual environment with various libraries and kernels required for machine learning.
Default Conda virtual environment is initialized and driven when the laptop is stopped and started, but the virtual environment and external libraries that the user installs in any path are not automatically initialized and are not retained when the laptop is stopped and started.
To resolve this issue, you must create a virtual environment in directory path /root/easymaker/custom-conda-envs
and install an external library in the created virtual environment.
AI EasyMaker notebook instance allows the virtual environment created in the /root/easymaker/custom-conda-envs
directory path to initialize and drive when the notebook is stopped and started.
Please refer to the following guide to configure your virtual environment.
Go to /root/easymaker/custom-conda-envs
path.
cd /root/easymaker/custom-conda-envs
To create virtual environment called easymaker_env
in python 3.8 version, run the command conda create
as follows
conda create --prefix ./easymaker_env python=3.8
Created virtual environment can be checked with conda env list
command.
(base) root@nb-xxxxxx-0:~# conda env list
# conda environments:
#
/opt/intel/oneapi/intelpython/latest
/opt/intel/oneapi/intelpython/latest/envs/2022.2.1
base * /opt/miniconda3
easymaker_env /root/easymaker/custom-conda-envs/easymaker_env
You can register scripts in the path /root/easymaker/cont-init.d
that should run automatically when the notebook is stopped and started.
The scripts are executed in ascending alphanumeric order.
/root/easymaker/cont-init.d
are executed.#!
./root/easymaker/cont-init.d/{SCRIPT}.exitcode
/root/easymaker/cont-init.d/{SCRIPT}.output
/root/easymaker/cont-init.output
Stop the running notebook or start the stopped notebook.
[Caution] How to retain your virtual environment and external libraries when starting the notebook after stopping it: When stopping and starting the notebook, the virtual environment and external libraries that the user create can be initialized. In order to retain, configure your virtual environment by referring to User Virtual Execution Environment Configuration.
[Note] Time to start and stop notebooks: It may take several minutes to start and stop notebooks.
Change the instance flavor of the created notebook. Instance flavor you want to change can only be changed to the same core type instance flavor as the existing instance.
[Note] Time to change instance flavors: It may take several minutes to change the instance flavor.
If a problem occurs while using the notebook, or if the status is ACTIVE but you can't access the notebook, you can reboot the notebook.
[Caution] How to retain your virtual environment and external libraries when rebooting the notebook: When rebooting the notebook, the virtual environment and external libraries that the user create can be initialized. In order to retain, configure your virtual environment by referring to User Virtual Execution Environment Configuration.
Delete the created notebook.
[Note] Storage: When deleting a notebook, boot storage and data storage are to be deleted. Connected NHN Cloud NAS is not deleted and must be deleted individually from NHN Cloud NAS.
Experiments are managed by grouping related trainings into experiments.
[Note] Experiment creation time: Creating experiments can take several minutes. When creating the initial resources (laptops, trainings, labs, endpoints), it takes an extra few minutes to configure the service environment.
Experiments appears. Select an experiment to view and modify detailed information.
Status: Experiment status appears. Please refer to the table below for main status.
Status | Description |
---|---|
CREATE REQUESTED | Creating an experiment is requested. |
CREATE IN PROGRESS | An experiment is being created. |
CREATE FAILED | Failed to create an experiment. Please try again. |
ACTIVE | The experiment is successfully created. |
Operation
Delete an experiment.
[Note] Unable to delete experiment if an associated resource exists: Experiment cannot be deleted if there are a training associated with the experiment, hyperparameter tuning, pipeline execution, pipeline schedule. Please delete the associated training first, then delete resources associated with the experiment. For associated resources, you can check the list by clicking the [Training] tab in the detail screen at the bottom that is displayed when you click the experiment you want to delete.
Provides an training environment where you can learn and identify machine training algorithms based on training results.
Set the training environment by selecting the instance and OS image to be trained, and proceed with training by entering the algorithm information and input/output data path to learn.
Algorithm information : Enter information about the algorithm you want to learn.
Own Algorithm : Uses an algorithm written by the user.
algorithm path
entry point
Image : Choose an image for your instance that matches the environment in which you need to run your training.
Training Resource Information
[Caution] When using NHN Cloud NAS: Only NHN Cloud NAS created in the same project as AI EasyMaker can be used.
[Caution] training failure when deleting training input data: Training may fail if the input data is deleted before training is completed.
A list of studies is displayed. If you select a training from the list, you can check detailed information and change the information.
Status : Shows the status of training. Please refer to the table below for the main status.
Status | Description |
---|---|
CREATE REQUESTED | You have requested to create a training. |
CREATE IN PROGRESS | This is a state in which resources necessary for training are being created. |
RUNNING | Training is in progress. |
STOPPED | Training is stopped at the user's request. |
COMPLETE | Training has been completed normally. |
STOP IN PROGRESS | Training is stopping. |
FAIL TRAIN | This is a failed state during training. Detailed failure information can be checked through the Log & Crash Search log when log management is enabled. |
CREATE FAILED | The training creation failed. If creation continues to fail, please contact customer service. |
FAIL TRAIN IN PROGRESS, COMPLETE IN PROGRESS | The resources used for training are being cleaned up. |
Operation
Hyperparameters : You can check the hyperparameter values set for training on the hyperparameter tab of the detailed screen displayed when selecting training.
Monitoring: When you select the endpoint stage, you can see a list of monitored instances and basic metrics charts in the Monitoring tab of the detailed screen that appears.
Create a new training with the same settings as an existing training.
Create a model with training in the completed state.
Deletes a training.
[Note] Training cannot be deleted if a related model exists: Training cannot be deleted if a model created by the training to be deleted exists. Please delete the model first and then the training.
Hyperparameter tuning is the process of optimizing hyperparameter values to maximize a model's predictive accuracy. If you don't use this feature, you'll have to manually tune the hyperparameters to find the optimal values while running many training jobs yourself.
How to configure a hyperparameter tuning job.
[Caution] When using NHN Cloud NAS: Only NHN Cloud NAS created in the same project as AI EasyMaker can be used.
[Caution] Training failure when deleting training input data: Training may fail if the input data is deleted before training is completed.
A list of hyperparameter tunings is displayed. Select a hyperparameter tuning from the list to view details and change information.
Status : Shows the status of hyperparameter tuning. Please refer to the table below for the main status.
Status | Description |
---|---|
CREATE REQUESTED | Requested to create hyperparameter tuning. |
CREATE IN PROGRESS | Resources required for hyperparameter tuning are being created. |
RUNNING | Hyperparameter tuning is in progress. |
STOPPED | Hyperparameter tuning is stopped at the user's request. |
COMPLETE | Hyperparameter tuning has been successfully completed. |
STOP IN PROGRESS | Hyperparameter tuning is stopping. |
FAIL HYPERPARAMETER TUNING | A failed state during hyperparameter tuning in progress. Detailed failure information can be checked through the Log & Crash Search log when log management is enabled. |
CREATE FAILED | Hyperparameter tuning generation failed. If creation continues to fail, please contact customer service. |
FAIL HYPERPARAMETER TUNING IN PROGRESS, COMPLETE IN PROGRESS, STOP IN PROGRESS | Resources used for hyperparameter tuning are being cleaned up. |
Status Details: The bracketed content in the COMPLETE
status is the status details. See the table below for key details.
Details | Description |
---|---|
GoalReached | Details when training for hyperparameter tuning is complete by reaching the target value. |
MaxTrialsReached | Details when hyperparameter tuning has reached the maximum number of training runs and is complete. |
SuggestionEndReached | Details when the exploration algorithm in Hyperparameter Tuning has explored all hyperparameters. |
Operation
Monitoring: When you select hyperparameter tuning, you can check the list of monitored instances and basic indicator charts in the Monitoring tab of the detailed screen that appears.
Displays a list of trainings auto-generated by hyperparameter tuning. Select a training from the list to check detailed information.
Status : Shows the status of the training automatically generated by hyperparameter tuning. Please refer to the table below for the main status.
Status | Description |
---|---|
CREATED | Training has been created. |
RUNNING | Training is in progress. |
SUCCEEDED | Training has been completed normally. |
KILLED | Training is stopped by the system. |
FAILED | This is a failed state during training. Detailed failure information can be checked through the Log & Crash Search log when log management is enabled. |
METRICS_UNAVAILABLE | This is a state where target metrics cannot be collected. |
EARLY_STOPPED | Performance (goal metric) is not getting better while training is in progress, so it is in an early-stopped state. |
Create a new hyperparameter tuning with the same settings as the existing hyperparameter tuning.
Create a model with the best training of hyperparameter tuning in the completed state.
Delete a hyperparameter tuning.
[Note] Hyperparameter tuning cannot be deleted if the associated model exists: Hyperparameter tuning cannot be deleted if the model created by the hyperparameter tuning you want to delete exists. Please delete the model first, then the hyperparameter tuning.
By creating a training template in advance, you can import the values entered into the template when creating training or hyperparameter tuning.
For information on what you can set in your training template, see Creating a training.
Displays a list of training templates. Select a training template from the list to view details and change information.
Create a new training template with the same settings as an existing training template.
Delete the training template.
Can manage models of AI EasyMaker's training outcomes or external models as artifacts.
obs://{Object Storage API endpoint}/{containerName}/{path}
.nas://{NAS ID}:/{path}
.[Caution] When using NHN Cloud NAS: Only NHN Cloud NAS created on the same project as AI EasyMaker is available to use.
[Caution] Retain model artifacts in storage: If not retained the model artifacts stored in storage, the creation of endpoints for that model fails.
[Note] Model Parameter: The values entered as model parameters are used when serving the model. Parameters can be used as arguments and environment variables: Arguments are used as the parameter name as entered, and environment variables are used with the parameter name converted to screaming snake notation. [Note] When creating HuggingFace model: When creating a HuggingFace model, you can create the model by entering the ID of the HuggingFace model as a parameter. The ID of the HuggingFace model can be found in the URL of the HuggingFace model page. For more information, see Appendix > 11. Framework-specific serving notes.
Model list is displayed. Selecting a model in the list allows to check detailed information and make changes to it.
Create an endpoint that can serve the selected model.
Create batch inferences with the selected model and view the inference results as statistics.
Delete a model.
[Note] Unable to delete model if associated endpoint exists: You cannot delete model if endpoint created by model want to delete is existed. To delete, delete the endpoint created by the model first and then delete the model.
Create and manage endpoints that can serve the model.
/inference
, you can request the inference API with POST https://{enpdoint-domain}/inference
.[Note] API Specification for Inference Request: The AI EasyMaker service provides endpoints based on the open inference protocol (OIP) specification. For the endpoint API specification, see Appendix > 10. Endpoint API specification. To use a separate endpoint, refer to the resources created in the API Gateway service and create a new resource to use it. For more information about the OIP specification, see OIP specification.
[Note] Time to create endpoints: Endpoint creation can take several minutes. Creation of the initial resources (notebooks, training, experiments, endpoints) takes additional few minutes to configure the service environment.
[Note] Restrictions on API Gateway service resource provision when creating endpoints: When you create a new endpoint, create a new API Gateway service. Adding new stage on existing endpoint creates new stage in API Gateway service. If you exceed the default provision in API Gateway Service Resource Provision Policy, you might not be able to create endpoints in AI EasyMaker. In this case, adjust API Gateway service resource quota.
Endpoints list is displayed. Select an endpoint in the list to check details and make changes to the information.
Status: Status of endpoint. Please refer to the table below for main status.
Status | Description |
---|---|
CREATE REQUESTED | Endpoint creation is requested. |
CREATE IN PROGRESS | Endpoint creation is in progress. |
UPDATE IN PROGRESS | Some of endpoint stages have tasks in progress. You can check the status of task for each stage in the endpoint stage list. |
DELETE IN PROGRESS | Endpoint deletion is in progress. |
ACTIVE | Endpoint is in normal operation. |
CREATE FAILED | Endpoint creation has failed. You must delete and recreate the endpoint. If the creation fails repeatedly, please contact the Customer Center. |
UPDATE FAILED | Some of endpoint stages are not serviced properly. You must delete and recreate the stages with issues. |
API Gateway Status: Displays API Gateway status information for default stage of endpoint. Please refer to the table below for main status.
Status | Description |
---|---|
CREATE IN PROGRESS | API Gateway Resource creation in progress. |
STAGE DEPLOYING | API Gateway default stage deploying in progress. |
ACTIVE | API Gateway default stage is successfully deployed and activated. |
NOT FOUND: STAGE | Default stage for endpoint is not found. Please check if the stage exists in API Gateway console. If stage is deleted, the deleted API Gateway stage cannot be recovered, and the endpoint have to be deleted and recreated. |
NOT FOUND: STAGE DEPLOY RESULT | The deployment status of the endpoint default stage is not found. Please check if the default stage is deployed in API Gateway console. |
STAGE DEPLOY FAIL | API Gateway default stage has failed to deploy. [Note] Please refer to Recovery method when the stage's API Gateway in 'Deployment Failure' status and recover from the deployment failed state. |
Add new stage to existing endpoint. You can create and test the new stage without affecting default stage.
Stage list created under endpoint is displayed. Select stage in the list to check more information in the list.
Status: Displays status of endpoint stage. Please refer to the table below for main status.
Status | Description |
---|---|
CREATE REQUESTED | Endpoint stage creation requested. |
CREATE IN PROGRESS | Endpoint stage creation is in progress. |
DEPLOY IN PROGRESS | Model deployment to the endpoint stage is in progress. |
DELETE IN PROGRESS | Endpoint stage deletion is in progress. |
ACTIVE | Endpoint stage is normal operation. |
CREATE FAILED | Endpoint stage creation has failed. Please try again. |
DEPLOY FAILED | Deployment to the endpoint stage has failed. Please try again. |
API Gateway Status: Displays stage status of API Gateway from where endpoint stage is deployed.
[Caution] Precautions when changing settings for API Gateway created by AI EasyMaker: When creating an endpoint or an endpoint stage, AI EasyMaker creates API Gateway services and stages for the endpoint. Please note the following precautions when changing API Gateway services and stages created by AI EasyMaker directly from API Gateway service console.
- Avoid deleting API Gateway services and stages created by AI EasyMaker. Deletion may prevent the endpoint from displaying API Gateway information correctly, and changes made to endpoint may not be applied to API Gateway.
- Avoid changing or deleting resources in API Gateway resource path that was entered when creating endpoints. Deletion may cause the endpoint's inference API call to fail
- Avoid adding resources in API Gateway resource path that was entered when creating endpoints. The added resources may be deleted when adding or changing endpoint stages.
- In the stage settings of API Gateway, do not disable Backend Endpoint Url Redifinition or change the URL set in API Gateway resource path. If you change the url, endpoint's inference API call might fail. Other than above precautions, other settings are available with features provided by API Gateway as necessary. For more information about how to use API Gateway, refer to API Gateway Console Guide.
[Note] Recovery method when the stage's API Gateway is in 'Deployment Failed' status: If stage settings of AI EasyMaker endpoint are not deployed to the API Gateway stage due to a temporary issue, deployment status is displayed as failed. In this case, you can deploy API Gateway stage manually by clicking Select Stage from the Stage list > View API Gateway Settings > 'Deploy Stage' in the bottom detail screen. If this guide couldn’t recover the deployment status, please contact the Customer Center.
Add a new resource to an existing endpoint stage.
A list of resources created under the endpoint stage is displayed.
Status : Shows the status of stage resource. Please refer to the table below for the main status.
Status | Description |
---|---|
CREATE REQUESTED | Creating stage resource requested. |
CREATE IN PROGRESS | Stage resource is being created. |
Training is properly completed. | Stage resource is being deleted. |
ACTIVE | Stage resource is deployed normally. |
CREATE FAILED | Creating stage resource failed. Please try again. |
Model Name: The name of the model deployed to the stage.
// Inference API example: Request
curl --location --request POST '{API Gateway Resource Path}' \
--header 'Content-Type: application/json' \
--data-raw '{
"instances": [
[6.8, 2.8, 4.8, 1.4],
[6.0, 3.4, 4.5, 1.6]
]
}'
// Inference API Example: Response
{
"predictions" : [
[
0.337502569,
0.332836747,
0.329660654
],
[
0.337530434,
0.332806051,
0.329663515
]
]
}
Change the default stage of the endpoint to another stage. To change the model of an endpoint without service stop, AI EasyMaker recommends deploying the model using stage capabilities.
[Caution] Delete stage of API Gateway service when deleting the endpoint stage: Deleting an endpoint stage in AI EasyMaker also deletes the stage in API Gateway service from which the endpoint's stage is deployed. If there is an API running on the API Gateway stage to be deleted, please be noted that API calls cannot be made.
Delete an endpoint.
[Caution] Delete API Gateway service when deleting the endpoint stage: Deleting an endpoint stage in AI EasyMaker also deletes API Gateway service from which the endpoint's stage was deployed. If there is API running on the API Gateway service to be deleted, please be noted that API calls cannot be made.
Provides an environment to make batch inferences from an AI EasyMaker model and view inference results in statistics.
Set up the environment in which batch inference will be performed by selecting an instance and OS image, and enter the paths to the input/output data to be inferred to proceed with batch inference.
[Caution] When using NHN Cloud NAS: Only NHN Cloud NAS created on the same project as AI EasyMaker is available to use.
[Caution] Batch inference fails when batch inference input data is deleted: Batch inference can fail if you delete input data before batch inference is complete.
[Caution] When setting input data detailed options: If the Glob pattern is not entered properly, batch inference may not work properly because the input data cannot be found. When used together with the Include Glob pattern, the Exclude Glob pattern takes precedence.
[Caution] When setting batch options: You must set the batch size and inference timeout appropriately based on the performance of the model you are batch inferring. If the settings you enter are incorrect, batch inference might not perform well enough.
[Caution] When using GPU instances: Batch inference using GPU instances allocates GPU instances based on the number of Pods If
Number of Pods / Number of GPUs
is not divisible by an integer, you may encounter unallocated GPUs Unallocated GPUs are not used by batch inference, so set the number of Pods appropriately to use GPU instances efficiently.
Displays a list of batch inferences. Select a batch inference from the list to check the details and change the information.
Status : Displays the status of batch inference. Please refer to the table below for the main status.
Failed Training : Indicates the number of failed lessons. | Best Training: Indicates the target metric information of the training that recorded the highest target metric value among the training automatically generated by hyperparameter tuning. |
---|---|
Status : Shows the status of hyperparameter tuning. Please refer to the table below for the main status. | You have requested to create a batch inference. |
API Gateway Status: Displays API Gateway status information for default stage of endpoint. Please refer to the table below for main status. | This is a state in which resources necessary for batch inference are being created. |
Description | Batch inference is in progress. |
Resources required for hyperparameter tuning are being created. | Batch inference is stopped at the user's request. |
COMPLETE | Batch inference has been completed successfully. |
STOP IN PROGRESS | Batch inference is stopping. |
FAIL BATCH INFERENCE | This is a failed state during batch inference. Detailed failure information can be checked through the Log & Crash Search log when log management is enabled. |
Stage resource is being deleted. | The batch inference creation failed. If creation continues to fail, please contact customer service. |
FAIL BATCH INFERENCE IN PROGRESS, COMPLETE IN PROGRESS | The resources used for batch inference are being cleaned up. |
Operation
Create a new batch inference with the same settings as an existing batch inference.
Delete a batch inference.
User-personalized container images can be used to drive notebooks, training, and hyperparameter tuning. Only private images derived from the notebook/deep learning images provided by AI EasyMaker can be used when creating resources in AI EasyMaker. See the table below for the base images in AI EasyMaker.
Image Name | CoreType | Framework | Framework version | Python version | Image address |
---|---|---|---|---|---|
Ubuntu 22.04 CPU Python Notebook | CPU | Python | 3.10.12 | 3.10 | fb34a0a4-kr1-registry.container.nhncloud.com/easymaker/python-notebook:3.10.12-cpu-py310-ubuntu2204 |
Ubuntu 22.04 GPU Python Notebook | GPU | Python | 3.10.12 | 3.10 | fb34a0a4-kr1-registry.container.nhncloud.com/easymaker/python-notebook:3.10.12-gpu-py310-ubuntu2204 |
Ubuntu 22.04 CPU PyTorch Notebook | CPU | PyTorch | 2.0.1 | 3.10 | fb34a0a4-kr1-registry.container.nhncloud.com/easymaker/pytorch-notebook:2.0.1-cpu-py310-ubuntu2204 |
Ubuntu 22.04 GPU PyTorch Notebook | GPU | PyTorch | 2.0.1 | 3.10 | fb34a0a4-kr1-registry.container.nhncloud.com/easymaker/pytorch-notebook:2.0.1-gpu-py310-ubuntu2204 |
Ubuntu 22.04 CPU TensorFlow Notebook | CPU | TensorFlow | 2.12.0 | 3.10 | fb34a0a4-kr1-registry.container.nhncloud.com/easymaker/tensorflow-notebook:2.12.0-cpu-py310-ubuntu2204 |
Ubuntu 22.04 GPU TensorFlow Notebook | GPU | TensorFlow | 2.12.0 | 3.10 | fb34a0a4-en1-registry.container.nhncloud.com/easymaker/tensorflow-notebook:2.12.0-gpu-py310-ubuntu2204 |
Image Name | CoreType | Framework | Framework version | Python version | Image address |
---|---|---|---|---|---|
Ubuntu 22.04 CPU PyTorch Training | CPU | PyTorch | 2.0.1 | 3.10 | fb34a0a4-kr1-registry.container.nhncloud.com/easymaker/pytorch-train:2.0.1-cpu-py310-ubuntu2204 |
Ubuntu 22.04 GPU PyTorch Training | GPU | PyTorch | 2.0.1 | 3.10 | fb34a0a4-en1-registry.container.nhncloud.com/easymaker/pytorch-train:2.0.1-gpu-py310-ubuntu2204 |
Ubuntu 22.04 CPU TensorFlow Training | CPU | TensorFlow | 2.12.0 | 3.10 | fb34a0a4-kr1-registry.container.nhncloud.com/easymaker/tensorflow-train:2.12.0-cpu-py310-ubuntu2204 |
Ubuntu 22.04 GPU TensorFlow Training | GPU | TensorFlow | 2.12.0 | 3.10 | fb34a0a4-kr1-registry.container.nhncloud.com/easymaker/tensorflow-train:2.12.0-gpu-py310-ubuntu2204 |
[Note] Limitations on using private images:
Only private images derived from base images provided by AI EasyMaker can be used. Only NHN Container Registry (NCR) can be integrated as a container registry service where private images are stored. (As of December 2023)
The following document explains how to create a container image with an AI EasyMaker-based image using Docker, and using a private image for notebooks in AI EasyMaker.
Create a DockerFile of private image.
FROM fb34a0a4-kr1-registry.container.nhncloud.com/easymaker/python-notebook:3.10.12-cpu-py310-ubuntu2204 as easymaker-notebook
RUN conda create -n example python=3.10
RUN conda activate example
RUN pip install torch torchvision
Build a private image and push to the container registry Build an image with a Dockerfile and save (push) the image to the NCR registry.
docker build -t {image name}:{tags} . .
docker tag {image name}:{tag} docker push {NCR registry address}/{image name}:{tag}
docker push {NCR registry address}/{image name}:{tag} .
(Example)
docker build -t custom-training:v1 .
docker tag custom-training:v1 example-kr1-registry.container.nhncloud.com/registry/custom-training:v1
docker push example-kr1-registry.container.nhncloud.com/registry/custom-training:v1
Create a private image in AI EasyMaker of the image you saved (pushed) to the NCR.
Create a notebook with the private image you created.
[Note] Where to use private images: Private images can be used for notebooks, training, and hyperparameter tuning to create resources.
[Note] Container registry service: NHN Container Registry (NCR) Only NCR service can be used as a container registry service. (As of December 2023) Enter the following values for the account ID and password for the NCR service. ID: User Access Key of NHN Cloud user account Password: User Secret Key of NHN Cloud user account
In order for AI EasyMaker to pull an image from a user's registry where private images are stored to power the container, they need to be logged into the user's registry. If you save your login information with a registry account, you can reuse it in images linked to that registry account. To manage your registry accounts, go to the Image menu in the AI EasyMaker console, then select the Registry Account tab.
Create a new registry account.
[Note] When you change your registry account, you sign in to the registry service with the changed username and password when using images associated with that account. If you enter an incorrect registry username and password, the login during a private image pull fails and the resource creation fails. If there are resources being created with a private image that has a registry account associated with it, or if there are studies and hyperparameters in progress, you cannot modify them.
Select the registry account you want to delete from the list, and click Delete Registry Account.
[Note] You cannot delete a registry account associated with an image. To delete, delete the associated image first and then delete the registry account.
ML Pipeline is a feature for managing and executing portable and scalable machine learning workflows. You can use the Kubeflow Pipelines (KFP) Python SDK to write components and pipelines, compile pipelines into intermediate representation YAML, and run them in AI EasyMaker.
[Note] What is a pipeline? A pipeline is a definition of a workflow that combines one or more components to form a directed acyclic graph (DAG). Each component runs a single container during execution, which can generate ML artifacts.
[Note] What are ML artifacts? Components can take inputs and produce outputs. There are two types of I/O types. Parameters and artifacts: 1. Parameters are useful for passing small amounts of data between components. 2. Artifact types are for ML artifact outputs, such as datasets, models, metrics, etc. Provides a convenient mechanism for saving to object storage.
Most pipelines aim to produce one or more ML artifacts, such as datasets, models, evaluation metrics, etc.
[Reference] Kubeflow Pipelines (KFP) official documentation
Upload a pipeline.
[Note] Pipeline upload time: Uploading a pipeline can take a few minutes. The initial resource creation requires an additional few minutes of time to configure the service environment.
A list of pipelines is displayed. Select a pipeline in the list to view details and make changes to the information.
Status: The status of the pipeline is displayed. See the table below for key statuses.
Status | Description |
---|---|
CREATE REQUESTED | Pipeline creation has been requested. |
CREATE IN PROGRESS | Pipeline creation is in progress. |
CREATE FAILED | Pipeline creation failed. Try again. |
ACTIVE | The pipeline was created successfully. |
A pipeline graph is displayed. Select a node in the graph to see more information.
A graph is a pictorial representation of a pipeline. Each node in the graph represents a step in the pipeline, with arrows indicating the parent/child relationship between the pipeline components represented by each step.
Delete the pipeline.
[Note] Cannot delete a pipeline if an associated pipeline schedule exists: You cannot delete a pipeline if a schedule created with the pipeline you want to delete exists. Delete the pipeline schedule first, then delete the pipeline.
You can run and manage your uploaded pipelines in AI EasyMaker.
Run the pipeline.
nas://{NAS ID}:/{path}
.[Caution] If you are using NHN Cloud NAS: Only NHN Cloud NAS created in the same project as AI EasyMaker is available.
[Note] Pipeline run generation time: Creating a pipeline run can take a few minutes. The initial resource creation requires an additional few minutes of time to configure the service environment.
A list of pipeline runs is displayed. Select a pipeline run in the list to view details and make changes to the information.
Status: The status of the pipeline execution is displayed. See the table below for key statuses.
Status | Description |
---|---|
CREATE REQUESTED | Pipeline execution creation is requested. |
CREATE IN PROGRESS | Pipeline run creation is in progress. |
CREATE FAILED | Pipeline execution creation failed. Try again. |
RUNNING | Pipeline execution is in progress. |
COMPLETE IN PROGRESS | The resources used to run the pipeline are being cleaned up. |
COMPLETE | The pipeline execution has completed successfully. |
Hyperparameter tuning is stopped at the user's request. | The pipeline is stopping running. |
STOPPED | The pipeline execution has been stopped at the user's request. |
FAIL PIPELINE RUN IN PROGRESS | The resources used to run the pipeline are being cleaned up. |
FAIL PIPELINE RUN | The pipeline execution has failed. Detailed failure information can be found in the Log & Crash Search log when log management is enabled. |
Operation
A graph of the pipeline run is displayed. Select a node in the graph to see more information.
The graph is a pictorial representation of the pipeline execution. This graph shows the steps that have already been executed and the steps that are currently executing during pipeline execution, with arrows indicating the parent/child relationship between the pipeline components represented by each step. Each node in the graph represents a step in the pipeline.
With node-specific details, you can download the generated artifacts.
[Caution] Pipeline artifact storage cycle: Artifacts older than 120 days are automatically deleted.
Stop running pipelines in progress.
[Note] How long it takes to stop running a pipeline: Stopping pipeline execution can take a few minutes.
Create a new pipeline run with the same settings as an existing pipeline run.
Delete a pipeline run.
You can create and manage a recurring run to periodically run the uploaded pipeline repeatedly in AI EasyMaker.
Create a recurring run to run the pipeline in periodic iterations.
For information beyond the items below that you can set in creating a pipeline schedule, see Create Recurring Run.
[Note] How long it takes to create a pipeline recurring run: Creating a recurring run can take a few minutes. The initial resource creation requires an additional few minutes of time to configure the service environment.
[Note] Cron expression format: The Cron expression uses six space-separated fields to represent the time. For more information, see the Cron Expression Format documentation.
A list of pipeline schedules is displayed. Select a pipeline recurring run in the list to view details and make changes to the information.
Status: The status of the pipeline recurring run is displayed. See the table below for key statuses.
Status | Description |
---|---|
CREATE REQUESTED | Pipeline recurring run creation has been requested. |
CREATE FAILED | Pipeline recurring run creation failed. Try again. |
ENABLED | The pipeline recurring run has started normally. |
ENABLED(EXPIRED) | The pipeline recurring run started successfully but has passed the end time you set. |
DISABLED | The pipeline recurring run has been stopped at the user's request. |
Manage Execution: When you select a pipeline recurring run in the list, you can view the list of runs generated by the pipeline recurring run on the Manage Run tab of the detail screen that appears.
Stop a started pipeline recurring run or start a stopped pipeline recurring run.
Create a new pipeline recurring run with the same settings as an existing pipeline recurring run.
Delete a pipeline recurring run.
[Note] You cannot delete a pipeline schedule if an associated pipeline run is in progress: You cannot delete a run generated by the pipeline schedule you want to delete if it is in progress. Delete the pipeline schedule after the pipeline run is complete.
Some features of AI EasyMaker use the user's NHN Cloud Object Storage as input/output storage You must allow read or write access to user’s AI EasyMaker system account in NHN Cloud Object Storage container for running normal features.
Allowing read/write permissions on the AI EasyMaker system account to the user's NHN Cloud Object Storage container is meaning that AI EasyMaker system account can read or write files in accordance with permissions granted to all files in the user's NHN Cloud Object Storage container.
You have to check this information to set up an access policy in User Object Storage only with the required accounts and permissions.
The 'User' take responsibility for all consequences of allowing the user to access Object Storage for an account other than the AI EasyMaker system account during the access policy setting process, and AI EasyMaker is not responsible for it.
[Note] According to features, AI EasyMaker accesses, reads or writes to Object Storage as follows.
Feature | Access Right | Access target |
---|---|---|
Training | Read | Algorithm path entered by user, training input data path |
Training | Write | User-entered training output data, checkpoint path |
Model | Read | Model artifact path entered by user |
Endpoint | Read | Model artifact path entered by user |
To add read/write permissions to AI EasyMaker system account in Object Storage, refer to the following:
Logs and events generated by the AI EasyMaker service can be stored in the NHN Cloud Log & Crash Search service. To store logs in the Log & Crash Search service, you have to enable Log & Crash service and separate usage fee will be charged.
AI EasyMaker service sends logs to Log & Crash Search service in the following defined fields:
Common Log Field
Name | Description | Valid range |
---|---|---|
easymakerAppKey | AI EasyMaker Appkey(AppKey) | - |
category | Log category | easymaker.training, easymaker.inference |
logLevel | Log level | INFO, WARNING, ERROR |
body | Log contents | - |
logType | Service name provided by log | NHNCloud-AIEasyMaker |
time | Log Occurrence Time (UTC Time) | - |
Training Log Field
Name | Description |
---|---|
trainingId | AI EasyMaker training ID |
Hyperparameter Tuning Log Field
Name | Description |
---|---|
hyperparameterTuningId | AI EasyMaker hyperparameter tuning ID |
Endpoint Log Field
Name | Description |
---|---|
endpointId | AI EasyMaker Endpoint ID |
endpointStageId | Endpoint stage ID |
inferenceId | Inference request own ID |
action | Action classification (Endpoint.Model) |
modelName | Model name to be inferred |
Batch Inference Log Field
Name | Description |
---|---|
batchInferenceId | AI EasyMaker batch inference ID |
As shown in the example below, you can use hyperparameter values entered during training creation.
import argparse
model_version = os.environ.get("EM_HP_MODEL_VERSION")
def parse_hyperparameters():
parser = argparse.ArgumentParser()
# Parsing the entered hyper parameter
parser.add_argument("--epochs", type=int, default=500)
parser.add_argument("--batch_size", type=int, default=32)
...
return parser.parse_known_args()
Key Environment Variables
Environment variable name | Description |
---|---|
EM_SOURCE_DIR | Absolute path to the folder where the algorithm script entered at the time of training creation is downloaded |
EM_ENTRY_POINT | Algorithm entry point name entered at training creation |
EM_DATASET_${Data set name} | Absolute path to the folder where each data set entered at the time of training creation is downloaded |
EM_DATASETS | Full data set list ( json format) |
EM_MODEL_DIR | Model storage path |
EM_CHECKPOINT_INPUT_DIR | Input checkout storage path |
EM_CHECKPOINT_DIR | Checkpoint Storage Path |
EM_HP_${ Upper case converted Hyperparameter key } | Hyperparameter value corresponding to the hyperparameter key |
EM_HPS | Full Hyperparameter List (in json format) |
EM_TENSORBOARD_LOG_DIR | TensorBoard log path for checking training results |
EM_REGION | Current Region Information |
EM_APPKEY | Appkey of AI EasyMaker service currently in use |
Example code for utilizing environment variables
import os
import tensorflow
dataset_dir = os.environ.get("EM_DATASET_TRAIN")
train_data = read_data(dataset_dir, "train.csv")
model = ... # Implement the model using input data
model.load_weights(os.environ.get('EM_CHECKPOINT_INPUT_DIR', None))
callbacks = [
tensorflow.keras.callbacks.ModelCheckpoint(filepath=f'{os.environ.get("EM_CHECKPOINT_DIR")}/cp-{{epoch:04d}}.ckpt', save_freq='epoch', period=50),
tensorflow.keras.callbacks.TensorBoard(log_dir=f'{os.environ.get("EM_TENSORBOARD_LOG_DIR")}'),
]
model.fit(..., callbacks)
model_dir = os.environ.get("EM_MODEL_DIR")
model.save(model_dir)
EM_TENSORBOARD_LOG_DIR
) when writing the training script.[Caution] TensorBoard metrics logs storage cycle: Metrics older than 120 days will be deleted automatically.
import tensorflow as tf
# Specify the TensorBoard log path
tb_log = tf.keras.callbacks.TensorBoard(log_dir=os.environ.get("EM_TENSORBOARD_LOG_DIR"))
model = ... # model implementation
model.fit(x_train, y_train, validation_data=(x_test, y_test),
epochs=100, batch_size=20, callbacks=[tb_log])
TF_CONFIG
required for distributed training is automatically set. For more information, please refer to the Tensorflow guide document.Backends
settings are required for distributed training. If distributed training is performed on CPU, set it to gloo, and if distributed training is performed on GPU, set it to nccl. For more information, please refer to the Pytorch guide document.The AI EasyMaker service periodically upgrades the cluster version to provide stable service and new features. When a new cluster version is deployed, you need to move the notebooks and endpoints that are running on the old version of the cluster to the new cluster. Explains how to move new clusters by resource.
On the Notebook list screen, notebooks that need to be moved to the new cluster display a Restart button to the left of their name. Hovering the mouse pointer over theRestart button displays restart instructions and an expiration date.
Restarts take about 25 minutes for the first run, and about 10 minutes for subsequent runs. Failed restarts are automatically reported to the administrator.
On the endpoints list screen, endpoints that need to be moved to the new cluster will have a ! Notice to the left of the name. If you hover over the ! Notice, it displays a version upgrade announcement and an expiration date. Before the expiration, you must follow these instructions to move stages running on the old version cluster to the new version cluster.
[Caution] Deleting a stage will shut down the endpoint, preventing API calls. Ensure that the stage is not in service before deleting it.
The default stage is the stage on which the actual service operates. To move the cluster version of the default stage without disrupting the service, use the following guide to move it.
exit code : -9 (pid: {pid})
When you create batch inferences and endpoints in AI EasyMaker, it allocates resources on the selected instance type, less the default usage. The amount of resources you need depends on the demand and complexity of your model, so carefully set the number of pods and resource quota along with the appropriate instance type.
Batch inference allocates resources to each pod by dividing the actual usage by the number of pods. Endpoint cannot allow the quota you enter to exceed the actual usage of your instance, so check your resource usage beforehand. Note that both batch inference and endpoints can fail to create if the allocated resources are less than the minimum usage required by the inference.
The AI EasyMaker service provides endpoints based on the open inference protocol (OIP) specification. For more information about the OIP specification, see OIP Specification.
Name | Method | API path |
---|---|---|
Model List | GET | /{model_name}/v1/models |
Model Ready | GET | /{model_name}/v1/models/{model_name} |
Inference | POST | /{model_name}/v1/models/{model_name}/predict |
Description | POST | /{model_name}/v1/models/{model_name}/explain |
Server information | GET | /{model_name}/v2 |
Server Live | GET | /{model_name}/v2/health/live |
Server Ready | GET | /{model_name}/v2/health/ready |
Model Information | GET | /{model_name}/v2/models/{model_name} |
Model Ready | GET | /{model_name}/v2/models/{model_name}/ready |
Inference | POST | /{model_name}/v2/models/{model_name}/infer |
OpenAI generative model inference | POST | /{model_name}/openai/v1/completions |
OpenAI generative model inference | POST | /{model_name}/openai/v1/chat/completions |
[Note] OpenAI generative model inference OpenAI generative model inference is used when using a generative model, such as OpenAI's GPT-4o. The inputs required for inference must be entered according to OpenAI's API specification. For more information, see the OpenAI API documentation. For models that support the Completion and Chat Completion APIs provided by AI EasyMaker, see Model endpoint compatibility.
The TensorFlow model serving provided by AI EasyMaker uses the SavedModel (.pb) recommended by TensorFlow. To use checkpoints, save the checkpoint variables directory saved as a SavedModel along with the model directory, which will be used to serve the model. Reference: https://www.tensorflow.org/guide/saved_model
AI EasyMaker serves PyTorch models (.mar) with TorchServe. We recommend using MAR files created using model-archiver, weight files can also be served, but there are files that are required along with the weight files. See the table below and the model-archiver documentation for the required files and detailed descriptions.
File name | Necessity | Description |
---|---|---|
model.py | Required | The model structure file passed in the model-file parameter. |
handler.py | Required | The file passed to the handler parameter to handle the inference logic. |
weight files (.pt, .pth, .bin) | Required | The file that stores the weights and structure of the model. |
requirements.txt | Optional | Files for installing Python packages needed when serving. |
extra/ | Optional | The files in the directory are passed in the extra-files parameter. |
The Hugging Face model can be served using the Runtime provided by AI EasyMaker, TensorFlow Serving, or TorchServe.
This is a simple way to serve Hugging Face models. Hugging Face Runtime Serving does not support fine-tuning. To serve fine-tuned models, use the TensorFlow/Pytorch Serving method.
[Note] Supported Hugging Face Tasks: Currently, the Hugging Face Runtime does not support the full range of Tasks in Hugging Face. The following tasks are supported:
sequence_classification
,token_classification
,fill_mask
,text_generation
, andtext2text_generation
. To use unsupported Tasks, use the TensorFlow/Pytorch Serving method.[Note] Gated Model: To serve a gated model, you must enter the token of an account that is allowed access as a model parameter. If you do not enter a token, or if you enter a token from an account that is not allowed, the model deployment fails.
How to serve a Hugging Face model trained with TensorFlow and PyTorch.
Download the Hugging Face model.
You can download it using the AutoTokenizer, AutoConfig, and AutoModel from the transformers library, as shown in the example code below.
from transformers import AutoTokenizer, AutoConfig, AutoModel
model_id = "<model_id>"
revision = "main"
model_dir = f"./models/{model_id}/{revision}"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
model_config = AutoConfig.from_pretrained(model_id, revision=revision)
model = AutoModel.from_config(model_config)
tokenizer.save_pretrained(model_dir)
model.save_pretrained(model_dir)
If the model fails to download, try importing the correct class for your non-AutoModel model and downloading it.
View the Hugging Face model information and generate the files needed to serve it.