Close Menu
    Trending
    • What If Your Portfolio Could Speak for You? | by Lusha Wang | Jun, 2025
    • High Paying, Six Figure Jobs For Recent Graduates: Report
    • What If I had AI in 2018: Rent the Runway Fulfillment Center Optimization
    • YouBot: Understanding YouTube Comments and Chatting Intelligently — An Engineer’s Perspective | by Sercan Teyhani | Jun, 2025
    • Inspiring Quotes From Brian Wilson of The Beach Boys
    • AI Is Not a Black Box (Relatively Speaking)
    • From Accidents to Actuarial Accuracy: The Role of Assumption Validation in Insurance Claim Amount Prediction Using Linear Regression | by Ved Prakash | Jun, 2025
    • I Wish Every Entrepreneur Had a Dad Like Mine — Here’s Why
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Artificial Intelligence»Learnings from a Machine Learning Engineer — Part 5: The Training
    Artificial Intelligence

    Learnings from a Machine Learning Engineer — Part 5: The Training

    FinanceStarGateBy FinanceStarGateFebruary 13, 2025No Comments16 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    On this fifth a part of my collection, I’ll define the steps for making a Docker container for coaching your picture classification mannequin, evaluating efficiency, and making ready for deployment.

    AI/ML engineers would favor to give attention to mannequin coaching and information engineering, however the actuality is that we additionally want to know the infrastructure and mechanics behind the scenes.

    I hope to share some ideas, not solely to get your coaching run operating, however find out how to streamline the method in a value environment friendly method on cloud sources resembling Kubernetes.

    I’ll reference parts from my earlier articles for getting the most effective mannequin efficiency, so remember to try Part 1 and Part 2 on the information units, in addition to Part 3 and Part 4 on mannequin analysis.

    Listed below are the learnings that I’ll share with you, as soon as we lay the groundwork on the infrastructure:

    • Constructing your Docker container
    • Executing your coaching run
    • Deploying your mannequin

    Infrastructure overview

    First, let me present a quick description of the setup that I created, particularly round Kubernetes. Your setup could also be completely completely different, and that’s simply fantastic. I merely need to set the stage on the infrastructure in order that the remainder of the dialogue is smart.

    Picture administration system

    It is a server you deploy that gives a person interface to on your subject material consultants to label and consider photos for the picture classification utility. The server can run as a pod in your Kubernetes cluster, however chances are you’ll discover that operating a devoted server with sooner disk could also be higher.

    Picture recordsdata are saved in a listing construction like the next, which is self-documenting and simply modified.

    Image_Library/
      - cats/
        - image1001.png
      - canines/
        - image2001.png

    Ideally, these recordsdata would reside on native server storage (as a substitute of cloud or cluster storage) for higher efficiency. The rationale for this may turn out to be clear as we see what occurs because the picture library grows.

    Cloud storage

    Cloud Storage permits for a nearly limitless and handy technique to share recordsdata between techniques. On this case, the picture library in your administration system may entry the identical recordsdata as your Kubernetes cluster or Docker engine.

    Nonetheless, the draw back of cloud storage is the latency to open a file. Your picture library may have 1000’s and 1000’s of photos, and the latency to learn every file may have a big impression in your coaching run time. Longer coaching runs means extra value for utilizing the costly GPU processors!

    The best way that I discovered to hurry issues up is to create a tar file of your picture library in your administration system and replica them to cloud storage. Even higher can be to create a number of tar recordsdata in parallel, every containing 10,000 to twenty,000 photos.

    This manner you solely have community latency on a handful of recordsdata (which include 1000’s, as soon as extracted) and also you begin your coaching run a lot sooner.

    Kubernetes or Docker engine

    A Kubernetes cluster, with correct configuration, will help you dynamically scale up/down nodes, so you may carry out your mannequin coaching on GPU {hardware} as wanted. Kubernetes is a quite heavy setup, and there are different container engines that can work.

    The expertise choices change consistently!

    The principle concept is that you just need to spin up the sources you want — for under so long as you want them — then scale down to cut back your time (and subsequently value) of operating costly GPU sources.

    As soon as your GPU node is began and your Docker container is operating, you may extract the tar recordsdata above to native storage, resembling an emptyDir, in your node. The node usually has high-speed SSD disk, very best for this kind of workload. There may be one caveat — the storage capability in your node should be capable of deal with your picture library.

    Assuming we’re good, let’s speak about constructing your Docker container so that you could practice your mannequin in your picture library.

    Constructing your Docker container

    With the ability to execute a coaching run in a constant method lends itself completely to constructing a Docker container. You’ll be able to “pin” the model of libraries so precisely how your scripts will run each time. You’ll be able to model management your containers as effectively, and revert to a recognized good picture in a pinch. What’s very nice about Docker is you may run the container just about wherever.

    The tradeoff when operating in a container, particularly with an Image Classification mannequin, is the pace of file storage. You’ll be able to connect any variety of volumes to your container, however they’re normally community hooked up, so there may be latency on every file learn. This will not be an issue if in case you have a small variety of recordsdata. However when coping with a whole bunch of 1000’s of recordsdata like picture information, that latency provides up!

    For this reason utilizing the tar file methodology outlined above may be helpful.

    Additionally, remember that Docker containers could possibly be terminated unexpectedly, so you need to make sure that to retailer necessary data outdoors the container, on cloud storage or a database. I’ll present you ways under.

    Dockerfile

    Understanding that you will want to run on GPU {hardware} (right here I’ll assume Nvidia), remember to choose the correct base picture on your Dockerfile, resembling nvidia/cuda with the “devel” taste that can include the correct drivers.

    Subsequent, you’ll add the script recordsdata to your container, together with a “batch” script to coordinate the execution. Right here is an instance Dockerfile, after which I’ll describe what every of the scripts can be doing.

    #####   Dockerfile   #####
    FROM nvidia/cuda:12.8.0-devel-ubuntu24.04
    
    # Set up system software program
    RUN apt-get -y replace && apg-get -y improve
    RUN apt-get set up -y python3-pip python3-dev
    
    # Setup python
    WORKDIR /app
    COPY necessities.txt
    RUN python3 -m pip set up --upgrade pip
    RUN python3 -m pip set up -r necessities.txt
    
    # Pythong and batch scripts
    COPY ExtractImageLibrary.py .
    COPY Coaching.py .
    COPY Analysis.py .
    COPY ScorePerformance.py .
    COPY ExportModel.py .
    COPY BulkIdentification.py .
    COPY BatchControl.sh .
    
    # Enable for interactive shell
    CMD tail -f /dev/null

    Dockerfiles are declarative, nearly like a cookbook for constructing a small server — what you’ll get each time. Python libraries profit, too, from this declarative method. Here’s a pattern necessities.txt file that masses the TensorFlow libraries with CUDA help for GPU acceleration.

    #####   necessities.txt   #####
    numpy==1.26.3
    pandas==2.1.4
    scipy==1.11.4
    keras==2.15.0
    tensorflow[and-cuda]

    Extract Picture Library script

    In Kubernetes, the Docker container can entry native, excessive pace storage on the bodily node. This may be achieved by way of the emptyDir quantity kind. As talked about earlier than, this may solely work if the native storage in your node can deal with the scale of your library.

    #####   pattern 25GB emptyDir quantity in Kubernetes   #####
    containers:
      - identify: training-container
        volumeMounts:
          - identify: image-library
            mountPath: /mnt/image-library
    volumes:
      - identify: image-library
        emptyDir:
          sizeLimit: 25Gi

    You’ll need to have one other volumeMount to your cloud storage the place you could have the tar recordsdata. What this appears to be like like will rely in your supplier, or in case you are utilizing a persistent quantity declare, so I received’t go into element right here.

    Now you may extract the tar recordsdata — ideally in parallel for an added efficiency increase — to the native mount level.

    Coaching script

    As AI/ML engineers, the mannequin coaching is the place we need to spend most of our time.

    That is the place the magic occurs!

    Along with your picture library now extracted, we will create our train-validation-test units, load a pre-trained mannequin or construct a brand new one, match the mannequin, and save the outcomes.

    One key approach that has served me effectively is to load probably the most not too long ago skilled mannequin as my base. I talk about this in additional element in Part 4 underneath “Fantastic tuning”, this ends in sooner coaching time and considerably improved mannequin efficiency.

    Make sure to make the most of the native storage to checkpoint your mannequin throughout coaching because the fashions are fairly giant and you might be paying for the GPU even whereas it sits idle writing to disk.

    This after all raises a priority about what occurs if the Docker container dies part-way although the coaching. The chance is (hopefully) low from a cloud supplier, and chances are you’ll not need an incomplete coaching anyway. But when that does occur, you’ll at the least need to perceive why, and that is the place saving the principle log file to cloud storage (described under) or to a bundle like MLflow is useful.

    Analysis script

    After your coaching run has accomplished and you’ve got taken correct precaution on saving your work, it’s time to see how effectively it carried out.

    Usually this analysis script will choose up on the mannequin that simply completed. However chances are you’ll determine to level it at a earlier mannequin model via an interactive session. For this reason have the script as stand-alone.

    With it being a separate script, which means it might want to learn the finished mannequin from disk — ideally native disk for pace. I like having two separate scripts (coaching and analysis), however you would possibly discover it higher to mix these to keep away from reloading the mannequin.

    Now that the mannequin is loaded, the analysis script ought to generate predictions on each picture within the coaching, validation, take a look at, and benchmark units. I save the outcomes as a enormous matrix with the softmax confidence rating for every class label. So, if there are 1,000 courses and 100,000 photos, that’s a desk with 100 million scores!

    I save these ends in pickle recordsdata which might be then used within the rating technology subsequent.

    Rating technology script

    Taking the matrix of scores produced by the analysis script above, we will now create numerous metrics of mannequin efficiency. Once more, this course of could possibly be mixed with the analysis script above, however my choice is for impartial scripts. For instance, I’d need to regenerate scores on earlier coaching runs. See what works for you.

    Listed below are a number of the sklearn capabilities that produce helpful insights like F1, log loss, AUC-ROC, Matthews correlation coefficient.

    from sklearn.metrics import average_precision_score, classification_report
    from sklearn.metrics import log_loss, matthews_corrcoef, roc_auc_score

    Other than these primary statistical analyses for every dataset (practice, validation, take a look at, and benchmark), it is usually helpful to determine:

    • Which floor fact labels get probably the most variety of errors?
    • Which predicted labels get probably the most variety of incorrect guesses?
    • What number of ground-truth-to-predicted label pairs are there? In different phrases, which courses are simply confused?
    • What’s the accuracy when making use of a minimal softmax confidence rating threshold?
    • What’s the error price above that softmax threshold?
    • For the “troublesome” benchmark units, do you get a sufficiently excessive rating?
    • For the “out-of-scope” benchmark units, do you get a sufficiently low rating?

    As you may see, there are a number of calculations and it’s not simple to provide you with a single analysis to determine if the skilled mannequin is sweet sufficient to be moved to manufacturing.

    The truth is, for a picture classification mannequin, it’s useful to manually evaluate the pictures that the mannequin acquired improper, in addition to those that acquired a low softmax confidence rating. Use the scores from this script to create a listing of photos to manually evaluate, after which get a gut-feel for the way effectively the mannequin performs.

    Try Part 3 for extra in-depth dialogue on analysis and scoring.

    Export script

    The entire heavy lifting is completed by this level. Since your Docker container can be shutdown quickly, now’s the time to repeat the mannequin artifacts to cloud storage and put together them for being put to make use of.

    The instance Python code snippet under is extra geared to Keras and TensorFlow. It will take the skilled mannequin and export it as a saved_model. Later, I’ll present how that is utilized by TensorFlow Serving within the Deploy part under.

    # Increment present model of mannequin and create new listing
    next_version_dir, version_number = create_new_version_folder()
    
    # Copy mannequin artifacts to the brand new listing
    copy_model_artifacts(next_version_dir)
    
    # Create the listing to avoid wasting the mannequin export
    saved_model_dir = os.path.be a part of(next_version_dir, str(version_number))
    
    # Save the mannequin export to be used with TensorFlow Serving
    tf.keras.backend.set_learning_phase(0)
    mannequin = tf.keras.fashions.load_model(keras_model_file)
    tf.saved_model.save(mannequin, export_dir=saved_model_dir)

    This script additionally copies the opposite coaching run artifacts such because the mannequin analysis outcomes, rating summaries, and log recordsdata generated from mannequin coaching. Don’t neglect about your label map so that you may give human readable names to your courses!

    Bulk identification script

    Your coaching run is full, your mannequin has been scored, and a brand new model is exported and able to be served. Now could be the time to make use of this newest mannequin to help you on attempting to determine unlabeled photos.

    As I described in Part 4, you might have a set of “unknowns” — actually good footage, however no concept what they’re. Let your new mannequin present a finest guess on these and file the outcomes to a file or a database. Now you may create filters primarily based on closest match and by excessive/low scores. This enables your subject material consultants to leverage these filters to search out new picture courses, add to present courses, or to take away photos which have very low scores and aren’t any good.

    By the way in which, I put this step contained in the GPU container since you might have 1000’s of “unknown” photos to course of and the accelerated {hardware} will make mild work of it. Nonetheless, in case you are not in a rush, you can carry out this step on a separate CPU node, and shutdown your GPU node sooner to avoid wasting value. This might particularly make sense in case your “unknowns” folder is on slower cloud storage.

    Batch script

    The entire scripts described above carry out a selected process — from extracting your picture library, executing mannequin coaching, performing analysis and scoring, exporting the mannequin artifacts for deployment, and maybe even bulk identification.

    One script to rule all of them

    To coordinate your complete present, this batch script offers you the entry level on your container and a simple technique to set off every thing. Make sure to produce a log file in case you might want to analyze any failures alongside the way in which. Additionally, remember to write the log to your cloud storage in case the container dies unexpectedly.

    #!/bin/bash
    # Essential batch management script
    
    # Redirect commonplace output and commonplace error to a log file
    exec > /cloud_storage/batch-logfile.txt 2>&1
    
    /app/ExtractImageLibrary.py
    /app/Coaching.py
    /app/Analysis.py
    /app/ScorePerformance.py
    /app/ExportModel.py
    /app/BulkIdentification.py

    Executing your coaching run

    So, now it’s time to place every thing in movement…

    Begin your engines!

    Let’s undergo the steps to organize your picture library, hearth up your Docker container to coach your mannequin, after which look at the outcomes.

    Picture library ‘tar’ recordsdata

    Your picture administration system ought to now create a tar file backup of your information. Since tar is a single-threaded perform, you’re going to get vital pace enchancment by creating a number of tar recordsdata in parallel, every with a portion of you information.

    Now these recordsdata may be copied to your shared cloud storage for the following step.

    Begin Docker container

    All of the exhausting work you place into creating your container (described above) can be put to the take a look at. If you’re operating Kubernetes, you may create a Job that can execute the BatchControl.sh script.

    Contained in the Kubernetes Job definition, you may go surroundings variables to regulate the execution of your script. For instance, the batch measurement and variety of epochs are set right here after which pulled into your Python scripts, so you may alter the habits with out altering your code.

    #####   pattern Job in Kubernetes   #####
    containers:
      - identify: training-job
        env:
          - identify: BATCH_SIZE
            worth: 50
          - identify: NUM_EPOCHS
            worth: 30
        command: ["/app/BatchControl.sh"]

    As soon as the Job is accomplished, remember to confirm that the GPU node correctly scales again all the way down to zero in keeping with your scaling configuration in Kubernetes — you don’t need to be saddled with an enormous invoice over a easy configuration error.

    Manually evaluate outcomes

    With the coaching run full, you need to now have mannequin artifacts saved and might look at the efficiency. Look via the metrics, resembling F1 and log loss, and benchmark accuracy for top softmax confidence scores.

    As talked about earlier, the reviews solely inform a part of the story. It’s definitely worth the effort and time to manually evaluate the pictures that the mannequin acquired improper or the place it produced a low confidence rating.

    Don’t neglect in regards to the bulk identification. Make sure to leverage these to find new photos to fill out your information set, or to search out new courses.

    Deploying your mannequin

    After you have reviewed your mannequin efficiency and are glad with the outcomes, it’s time to modify your TensorFlow Serving container to place the brand new mannequin into manufacturing.

    TensorFlow Serving is on the market as a Docker container and supplies a really fast and handy technique to serve your mannequin. This container can hear and reply to API calls on your mannequin.

    Let’s say your new mannequin is model 7, and your Export script (see above) has saved the mannequin in your cloud share as /image_application/fashions/007. You can begin the TensorFlow Serving container with that quantity mount. On this instance, the shareName factors to folder for model 007.

    #####   pattern TensorFlow pod in Kubernetes   #####
    containers:
      - identify: tensorflow-serving
        picture: bitnami/tensorflow-serving:2.18.0
        ports:
          - containerPort: 8501
        env:
          - identify: TENSORFLOW_SERVING_MODEL_NAME
            worth: "image_application"
        volumeMounts:
          - identify: models-subfolder
            mountPath: "/bitnami/model-data"
    
    volumes:
      - identify: models-subfolder
        azureFile:
          shareName: "image_application/fashions/007"

    A refined word right here — the export script ought to create a sub-folder, named 007 (similar as the bottom folder), with the saved mannequin export. This may occasionally appear somewhat complicated, however TensorFlow Serving will mount this share folder as /bitnami/model-data and detect the numbered sub-folder inside it for the model to serve. It will help you question the API for the mannequin model in addition to the identification.

    Conclusion

    As I discussed at the beginning of this text, this setup has labored for my state of affairs. That is actually not the one technique to method this problem, and I invite you to customise your personal answer.

    I wished to share my hard-fought learnings as I embraced cloud providers in Kubernetes, with the need to maintain prices underneath management. After all, doing all this whereas sustaining a excessive stage of mannequin efficiency is an added problem, however one which you could obtain.

    I hope I’ve supplied sufficient data right here that can assist you with your personal endeavors. Completely happy learnings!



    Source link
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous Articlejchc
    Next Article Why The Wisest Leaders Listen First Before They Act
    FinanceStarGate

    Related Posts

    Artificial Intelligence

    What If I had AI in 2018: Rent the Runway Fulfillment Center Optimization

    June 14, 2025
    Artificial Intelligence

    AI Is Not a Black Box (Relatively Speaking)

    June 13, 2025
    Artificial Intelligence

    Boost Your LLM Output and Design Smarter Prompts: Real Tricks from an AI Engineer’s Toolbox

    June 13, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    When You Just Can’t Decide on a Single Action

    March 8, 2025

    Profitable, AI-Powered Tech, Now Preparing for a Potential Public Listing

    June 7, 2025

    Jeff Bezos Is Selling Billions Worth of Amazon Stock

    May 3, 2025

    A Little More Conversation, A Little Less Action — A Case Against Premature Data Integration

    March 29, 2025

    How I Trained a Machine Learning Model to Predict Car Prices (And How You Can Too) | by Ishan Shrestha | May, 2025

    May 16, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    AI Agents from Zero to Hero — Part 3

    March 29, 2025

    Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge

    February 20, 2025

    What’s next for robots | MIT Technology Review

    February 1, 2025
    Our Picks

    Inside the tedious effort to tally AI’s energy appetite

    June 3, 2025

    How to Measure the Reliability of a Large Language Model’s Response

    February 13, 2025

    The Revolution of Reasoning in AI: How Advanced Models Think Before They Speak | by Mohammad Yaseen | Mar, 2025

    March 29, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.