How to Implement AI Self Checkout in Retail if You Are Not Amazon

MobiDev

- Last Updated: December 2, 2024

MobiDev

- Last Updated: December 2, 2024

Online retail has one key advantage — customer experience. No queues, no delays, and little movement to make a purchase. According to research from Forrester, 72 percent of U.S. retail sales will still occur in brick-and-mortar stores because people want to interact with a product before buying, or simply don’t want to wait for delivery. AI retail checkout may be the solution to this.

The idea of checkout-free shopping in venues crystalized as Amazon Go, Tesco, Walmart, and many more. The idea of using fully-automated checkout with computer vision is a successful example of retail automation. But, a few store owners want to build a whole new outlet to run their business offline. As it requires an integrated software infrastructure, as well as imposes development and financial challenges we will discuss today.

In this article, we’ll analyze how any brick and mortar store can be automated with computer vision systems. Here we’ll look at how it works, what the options for checkout automation are, and what the challenges are.

The idea of using fully-automated checkout with computer vision is a successful example of retail automation. But, a few store owners want to build a whole new outlet to run their business offline. - MobiDev

Computer Vision Checkout Automation for Brick and Mortar Retail

The majority of in-store operations like shelf management, checkout, or product weighing require human supervision. Human productivity is basically a performance marker for the retailer, and it often becomes a bottleneck, as well as becoming a customer frustration factor.

Namely, checkout queues are the pain point both for customers and retailers. And it’s not only the queues – actual human effort costs money. So how does computer vision apply to these operations?

Computer vision (CV) is a technology under the hood of artificial intelligence that enables machines to extract meaningful information from the image. At its core, computer vision aims at mimicking human sight. So analogically to an eye, CV relies on camera sensors that capture the environment. In turn, an underlying neural network, its brain, will recognize objects, their position in the frame, or some other specific properties (such as differing a Pepsi can from Dr. Pepper can).

That’s our ground base for understanding how computer vision can fit brick and mortar retail tasks, as it can recognize products situated in the frame. These products can be placed on the shelves, or carried by the customers, which allows us to exclude barcode scanning, cash register operation, or self-checkout machines.

Although implementations of computer vision significantly differ by complexity and budgeting, there are two common scenarios of how it can be used for retail automation. First, let’s look at how full store automation can be built.

AI-Powered Autonomous Checkout: Full Store Automation

Autonomous checkout is called by different names: “cashierless”, “grab-and-go”, “checkout-free”, etc. In the shopping experience of Amazon, Tesco, and even Walmart, such stores check the products during the shopping, and charge for them when you walk out. It sounds simple, and this is how it essentially works:

Shopping session start. Shops like Amazon use turnstiles to initiate shopping via scanning a QR code. At this point, the system matches the Amazon profile and digital wallet with the actual person entering the store.

Person detection. This is the recognition and tracking of people and objects done via computer vision cameras. Cameras remember who the person is, and once they take a product from the shelf, the system places it into a virtual shopping cart. Some shops use hundreds of cameras to view from different angles and cover all of the store zones.

Product recognition. Once the person grabs something from the shelf and takes it with them, cameras capture this action. Matching the product image on video with the actual product in the retailer’s database, the store places an item into a virtual shopping cart.

Checkout. As the product list is finished, the person can just walk out. When the person leaves a zone covered by cameras, computer vision considers this as the end of a shopping session. This triggers the system to calculate the total sum and charge it from the customer’s digital wallet.

From the customer standpoint, such a system represents a similar shopping experience as it is in the online stores, except you don’t need to checkout. You enter, find what you want, grab it, and leave. Although, to provide customers with full autonomy, and cover all the edge cases, we’ll need to solve a large number of problems technically.

The Challenges of AI-Powered Autonomous Stores

Customer behavior can be unpredictable, as we are going to automate checkout for dozens of people that check out and buy thousands of products at the same time. This imposes a number of challenges for computer vision:

Continuous Person Tracking

As the customer enters the store, the system should be able to continuously track them along shopping routes. We need to know that it’s the same person who took this or that item in different parts of the store. In a crowded store, continuous tracking might be difficult. As long as it’s not allowed to use face recognition, the model should recognize people by their appearance. So what will happen if somebody takes off his coat, or carries a child on shoulders?

To enable continuous tracking, we’ll need to provide 100 percent coverage for cameras to detect people passing from zone to zone. Placing cameras at different angles, we'll also need sensors to communicate their precise location so we can use this data to track objects more accurately.

The “Who Took What?” Problem

There is also a large variety of products, and customers’ shopping process is not linear. They move items, smell them, put them back, and go to another shelf. Especially when there are multiple people at one shelf, it becomes difficult for a model to recognize who took what, and if they actually took the product to buy.

Amazon, for example, solved this problem by implementing human pose estimation and human activity analysis. Basically, that’s another layer of artificial intelligence coupled with computer vision – it measures the position and movement of a person to predict what he or she grabs and if the product was taken to be purchased.

This solves the problem of having multiple customers at a shelf, and helps to denote who took a specific product even if the camera was blocked by somebody.

Identifying Similar Products

Concerning products, we’ll also need to deal with similar packages. Some products have minor differences in their look, which makes it harder for the model to fetch all the details, especially if there is some obstruction happening in the frame or the object is moving fast. We can address this issue through training the model to spot little details, and using cameras with higher resolution and frame rate.

While it seems beneficial to use autonomous checkout, the complexity of such a system can be difficult. For a tech-first company, this is not a problem. But for the usual retailer, the burden brought by artificial intelligence lowers the value of such automation. That’s why partial store automation with computer vision could be more suitable for many businesses.

Smart Vending Machines: Partial Store Automation

Vending machines can be the perfect solution to the problem imposed by tracking the whole store. Vending machines can be represented by shelves with glass doors or regular fridges using computer vision cameras to operate purchase processes. By installing a QR code scanner, we can minimize the checkout procedure to the location of a single fridge. So the idea is quite simple:

Shopping session start. The session starts once a person approaches the fridge and opens it up. This can be done via scanning a QR via mobile app if it’s a door-closed fridge. In the case of a usual shelf, cameras can track what’s grabbed from it to initiate the session.

Creating a virtual shopping cart. As the person scans the QR code, it’s a signal for a system to create a shopping cart for this specific user.

Product recognition. The cameras might be installed inside or outside of the vending machine. The internal cameras should be able to track the taken/put back products. External cameras might track manipulations within an open fridge, just like with a regular shelf. Both types of cameras capture the products and put them into a shopping cart.

As the person might examine multiple items and move from side to side, CV cameras can also track the person in the frame. This will help us verify that it’s a single person making a purchase, and not another one standing nearby.

Verifying products. When the product is taken, the system sends this data to compare the image of the product with the one in the database and extract the price. Additionally, we can update availability automatically in our inventory management system.

Editing product list. Once the products are taken, they will be sent to the user’s shopping cart available on their smartphone, or tablet on the fridge. Here, the customer can modify items, and proceed to the payment.

Checkout. In the case of a mobile application and QR code scanning, closing the fridge might be a trigger point to complete a purchase and charge a sum from a digital wallet. But, there might also be a POS terminal installed to allow credit card payment. At this point, the purchase is done, and the person can leave the store.

While it seems like a relatively weak alternative to the autonomous checkout system, vending machines can be scaled easily to automate the whole store, which can make a difference in terms of customer experience but require less engineering effort and budgeting.

The same concept of modular automation can be applied to numerous other cases. In addition to supermarkets and grocery stores, computer-vision kiosks can also be installed in foodservice venues or coffee shops.

Checkout-Free Foodservice

Restaurants, cafes, and canteens often use a buffet serving system like a sideboard with portioned dishes customers can choose from. Customers place dishes on trays then need to check out their order, which can potentially be handled by a computer vision kiosk.

A machine learning model sitting on the backend can be trained to recognize dishes and other products placed on the tray to launch the checkout process. This idea can be implemented as a checkout kiosk where a set of cameras will scan the order. The actual payment can be completed via a usual POS terminal or using a mobile application and a digital wallet.

The concept of cashier-less operations can be taken to extremes like with Starbucks. Using Amazon’s system, Starbucks became the first of a kind grab & go coffee shop. Customers can place an order via a mobile application and come for their coffee without any checkout similar to Amazon GO. However, handling computer vision projects requires knowledge of a subject matter. Specifically, data science and machine learning expertise.

So now let’s talk a bit of what you should know to approach computer vision-based checkout automation.

How to Approach AI-Based Checkout

Let’s examine the steps it takes to create a computer vision system for automation in retail. We’ll focus on the smart fridge case, as it is the most approachable and versatile.

Gathering Requirements

First of all, we need to understand our business case in detail:

Preferred automation method. Choosing between smart fridges or other types of dispenser machines might require less global modifications to the store while maintaining a scalable approach. Full store automation will mostly require changes to the venue layout, and additional hardware like turnstiles, which can be a con for the majority of the store owners.

Store size. Vending machines can be installed in basically any number, to cover all of the store’s inventory and product diversity. So the store size will determine how many vending machines you’ll need, and what the store layout will be.

Quantity of products for recognition. As with any other machine learning project, a computer vision system requires training before it can recognize anything – a single fridge might contain 20 to 50 different products. So, we should consider those numbers as it will determine how long the training phase will take.

Existing infrastructure. In most cases, physical stores don’t have enough integration between inventory management, point of sale, and accounting. Computer vision systems will require access to the store data to automate sales updates and product availability, so examining your existing infrastructure is another point to understand when considering the requirements of this project.

So, let’s say a single fridge can contain 35 items.

Data collection

Computer vision is an artificial intelligence technology, which means we need data in order to recognize objects. The data is used for model training to identify different products in the frame, as well as identify people and what they grab.

The optimal way to collect data for object recognition is to record each product on video from different angles and lighting conditions. It is important to have these videos categorized by product so the labeling (what product is in the frame) will be done automatically. General recommendations for gathering the data are that it should be as close as possible to how it will look for real users.

Once we implement a working model to automate checkout, we’ll need 60 frames per second. This is required to guarantee the fast operation of the model. The higher the frame rate, the smoother the image is and the more detail we can extract from it.

Model Training

The next step is training. Once we collect all the video recordings, a machine learning expert will prepare them for model training. This process can be split into two tasks:

Preparing data means we need to split all the video frames into separate images, and label the products we need to detect. Put simply, we extract 60 photos out of a minute long video, and draw bounding boxes around our target objects.
Choosing an algorithm. An algorithm is a mathematical model that learns patterns from the given data to make predictions. For tasks like object recognition, there are existing working algorithms that can be applied for building a model. So our task here is to choose a suitable one, and feed it with our data.

The process of training may take several weeks as we struggle to get decent accuracy.

Model Retraining

If any products are added or swapped in the process, the model needs to be retrained. This is because prediction results will differ depending on the data input. This means that each time a store obtains new items for sales and places them into a computer vision fridge, we’ll need to launch a new training phase for the model to learn new items.

Given that, we’ll need retraining to recognize, say, Pringles cans on the image if there weren’t any Pringles before. This becomes easier as soon as we implement cameras in the fridge because we can use live recordings to make annotations and launch training again.

Required Infrastructure

The existing infrastructure in the store is usually represented by a server that processes inventory updates and records sales volume via POS terminals. To implement a machine learning model, we’ll need to add several components:

Cameras – These are used to record and pass the visual data.
Video processing unit – This can be a video card or a single board computer like the Nvidia Jetson that includes a GPU optimized for computer vision needs.
QR scanner – This sticker is placed on a turnstile or a fridge the user scans to identify the person and launch the shopping process.
Model server – As we’re talking about real time video processing, implementing a hardware server at the store will guarantee more stable results. Basically, as a person grabs something from a fridge, the reaction of the system should be noteless so that hardware components can respond fast enough.

"All of those components should be interconnected, as there has to be data flow between each unit. As for the cameras, we also want to make sure the store has a stable and fast bandwidth. Since cameras will process live streams of data in real-time, there has to be no delay for the model to function properly. On the other hand, the customer will expect a fast reaction from the vending machine, which depends on how quickly the model receives and processes the data. - Daniil Liadov, Python engineer at MobiDev

Privacy Concerns

Among other questions that might concern both retailers and customers is privacy. Since computer vision is designed to detect and track objects on video, recording and storing such data may violate privacy laws in some countries.

In the US, it’s generally legal to use surveillance cameras in stores. As long as customers are tracked with random IDs just for the sake of the checkout task, no other technologies like face recognition are required. Even if the camera captures a person’s face, it could be blurred using AI to sustain confidentiality.

Is AI Self-Checkout for Every Retailer?

As with all systems, autonomous checkout may seem like a pricey and bulky thing to implement. Customers are still willing to use more convenient checkout methods, however, as shown in a Retail Customer Experience report from 2021 that found that 60 percent of consumers would choose self-checkout over interaction with a cashier.

That being said, vending machines might be an affordable option for the retail industry, as they bring a lot of benefits at a reasonable cost. Additionally, such systems can be customized to serve the specific needs of a given retailer due to the flexibility of machine learning models. Basically, any type of product can be recognized with proper training. So, convenience stores are not the only ones that can benefit from computer vision applications.