The panorama of computing is present process a profound transformation with the emergence of spatial computing platforms(VR and AR). As we step into this new period, the intersection of digital actuality, Augmented Reality, and on-device machine studying presents unprecedented alternatives for builders to create experiences that seamlessly mix digital content material with the bodily world.
The introduction of visionOS marks a major milestone on this evolution. Apple’s Spatial Computing platform combines refined {hardware} capabilities with highly effective improvement frameworks, enabling builders to construct functions that may perceive and work together with the bodily surroundings in actual time. This convergence of spatial consciousness and on-device machine studying capabilities opens up new prospects for object recognition and monitoring functions that had been beforehand difficult to implement.
What We’re Constructing
On this information, we’ll be constructing an app that showcases the facility of on-device machine studying in visionOS. We’ll create an app that may acknowledge and monitor a eating regimen soda can in actual time, overlaying visible indicators and knowledge instantly within the consumer’s area of view.
Our app will leverage a number of key applied sciences within the visionOS ecosystem. When a consumer runs the app, they’re offered with a window containing a rotating 3D mannequin of our goal object together with utilization directions. As they appear round their surroundings, the app repeatedly scans for eating regimen soda cans. Upon detection, it shows dynamic bounding traces across the can and locations a floating textual content label above it, all whereas sustaining exact monitoring as the item or consumer strikes by way of area.
Earlier than we start improvement, let’s guarantee now we have the mandatory instruments and understanding in place. This tutorial requires:
- The most recent model of Xcode 16 with visionOS SDK put in
- visionOS 2.0 or later operating on an Apple Imaginative and prescient Professional machine
- Primary familiarity with SwiftUI and the Swift programming language
The event course of will take us by way of a number of key levels, from capturing a 3D mannequin of our goal object to implementing real-time monitoring and visualization. Every stage builds upon the earlier one, supplying you with an intensive understanding of creating options powered by on-device machine studying for visionOS.
Constructing the Basis: 3D Object Seize
Step one in creating our object recognition system entails capturing an in depth 3D mannequin of our goal object. Apple offers a robust app for this goal: RealityComposer, accessible for iOS by way of the App Retailer.
When capturing a 3D mannequin, environmental circumstances play a vital function within the high quality of our outcomes. Organising the seize surroundings correctly ensures we get the very best knowledge for our machine studying mannequin. A well-lit area with constant lighting helps the seize system precisely detect the item’s options and dimensions. The eating regimen soda can ought to be positioned on a floor with good distinction, making it simpler for the system to tell apart the item’s boundaries.
The seize course of begins by launching the RealityComposer app and choosing “Object Seize” from the accessible choices. The app guides us by way of positioning a bounding field round our goal object. This bounding field is crucial because it defines the spatial boundaries of our seize quantity.

As soon as we’ve captured all the main points of the soda can with the assistance of the in-app information and processed the photographs, a .usdz file containing our 3D mannequin shall be created. This file format is particularly designed for AR/VR functions and accommodates not simply the visible illustration of our object, but additionally necessary info that shall be used within the coaching course of.
Coaching the Reference Mannequin
With our 3D mannequin in hand, we transfer to the following essential part: coaching our recognition mannequin utilizing Create ML. Apple’s Create ML software offers an easy interface for coaching machine studying fashions, together with specialised templates for spatial computing functions.
To start the coaching course of, we launch Create ML and choose the “Object Monitoring” template from the spatial class. This template is particularly designed for coaching fashions that may acknowledge and monitor objects in three-dimensional area.

After creating a brand new undertaking, we import our .usdz file into Create ML. The system mechanically analyzes the 3D mannequin and extracts key options that shall be used for recognition. The interface offers choices for configuring how our object ought to be acknowledged in area, together with viewing angles and monitoring preferences.
When you’ve imported the 3d mannequin and analyzed it in varied angles, go forward and click on on “Prepare”. Create ML will course of our mannequin and start the coaching part. Throughout this part, the system learns to acknowledge our object from varied angles and underneath completely different circumstances. The coaching course of can take a number of hours because the system builds a complete understanding of our object’s traits.

The output of this coaching course of is a .referenceobject file, which accommodates the educated mannequin knowledge optimized for real-time object detection in visionOS. This file encapsulates all of the realized options and recognition parameters that may allow our app to establish eating regimen soda cans within the consumer’s surroundings.
The profitable creation of our reference object marks an necessary milestone in our improvement course of. We now have a educated mannequin able to recognizing our goal object in real-time, setting the stage for implementing the precise detection and visualization performance in our visionOS software.
Preliminary Challenge Setup
Now that now we have our educated reference object, let’s arrange our visionOS undertaking. Launch Xcode and choose “Create a brand new Xcode undertaking”. Within the template selector, select visionOS underneath the platforms filter and choose “App”. This template offers the essential construction wanted for a visionOS software.

Within the undertaking configuration dialog, configure your undertaking with these major settings:
- Product Identify: SodaTracker
- Preliminary Scene: Window
- Immersive House Renderer: RealityKit
- Immersive House: Blended
After undertaking creation, we have to make just a few important modifications. First, delete the file named ToggleImmersiveSpaceButton.swift as we received’t be utilizing it in our implementation.
Subsequent, we’ll add our beforehand created belongings to the undertaking. In Xcode’s Challenge Navigator, find the “RealityKitContent.rkassets” folder and add the 3D object file (“SodaModel.usdz” file). This 3D mannequin shall be utilized in our informative view. Create a brand new group named “ReferenceObjects” and add the “Eating regimen Soda.referenceobject” file we generated utilizing Create ML.
The ultimate setup step is to configure the mandatory permission for object monitoring. Open your undertaking’s Information.plist file and add a brand new key: NSWorldSensingUsageDescription. Set its worth to “Used to trace eating regimen sodas”. This permission is required for the app to detect and monitor objects within the consumer’s surroundings.
With these setup steps full, now we have a correctly configured visionOS undertaking prepared for implementing our object monitoring performance.
Entry Level Implementation
Let’s begin with SodaTrackerApp.swift, which was mechanically created once we arrange our visionOS undertaking. We have to modify this file to help our object monitoring performance. Exchange the default implementation with the next code:
import SwiftUI
/**
SodaTrackerApp is the primary entry level for the applying.
It configures the app's window and immersive area, and manages
the initialization of object detection capabilities.
The app mechanically launches into an immersive expertise
the place customers can see Eating regimen Soda cans being detected and highlighted
of their surroundings.
*/
@principal
struct SodaTrackerApp: App {
/// Shared mannequin that manages object detection state
@StateObject non-public var appModel = AppModel()
/// System surroundings worth for launching immersive experiences
@Atmosphere(.openImmersiveSpace) var openImmersiveSpace
var physique: some Scene {
WindowGroup {
ContentView()
.environmentObject(appModel)
.process {
// Load and put together object detection capabilities
await appModel.initializeDetector()
}
.onAppear {
Activity {
// Launch instantly into immersive expertise
await openImmersiveSpace(id: appModel.immersiveSpaceID)
}
}
}
.windowStyle(.plain)
.windowResizability(.contentSize)
// Configure the immersive area for object detection
ImmersiveSpace(id: appModel.immersiveSpaceID) {
ImmersiveView()
.surroundings(appModel)
}
// Use combined immersion to mix digital content material with actuality
.immersionStyle(choice: .fixed(.combined), in: .combined)
// Cover system UI for a extra immersive expertise
.persistentSystemOverlays(.hidden)
}
}
The important thing facet of this implementation is the initialization and administration of our object detection system. When the app launches, we initialize our AppModel which handles the ARKit session and object monitoring setup. The initialization sequence is essential:
.process {
await appModel.initializeDetector()
}
This asynchronous initialization hundreds our educated reference object and prepares the ARKit session for object monitoring. We guarantee this occurs earlier than opening the immersive area the place the precise detection will happen.
The immersive area configuration is especially necessary for object monitoring:
.immersionStyle(choice: .fixed(.combined), in: .combined)
The combined immersion model is important for our object monitoring implementation because it permits RealityKit to mix our visible indicators (bounding packing containers and labels) with the real-world surroundings the place we’re detecting objects. This creates a seamless expertise the place digital content material precisely aligns with bodily objects within the consumer’s area.
With these modifications to SodaTrackerApp.swift, our app is able to start the item detection course of, with ARKit, RealityKit, and our educated mannequin working collectively within the combined actuality surroundings. Within the subsequent part, we’ll look at the core object detection performance in AppModel.swift, one other file that was created throughout undertaking setup.
Core Detection Mannequin Implementation
AppModel.swift, created throughout undertaking setup, serves as our core detection system. This file manages the ARKit session, hundreds our educated mannequin, and coordinates the item monitoring course of. Let’s look at its implementation:
import SwiftUI
import RealityKit
import ARKit
/**
AppModel serves because the core mannequin for the soda can detection software.
It manages the ARKit session, handles object monitoring initialization,
and maintains the state of object detection all through the app's lifecycle.
This mannequin is designed to work with visionOS's object monitoring capabilities,
particularly optimized for detecting Eating regimen Soda cans within the consumer's surroundings.
*/
@MainActor
@Observable
class AppModel: ObservableObject {
/// Distinctive identifier for the immersive area the place object detection happens
let immersiveSpaceID = "SodaTracking"
/// ARKit session occasion that manages the core monitoring performance
/// This session coordinates with visionOS to course of spatial knowledge
non-public var arSession = ARKitSession()
/// Devoted supplier that handles the real-time monitoring of soda cans
/// This maintains the state of presently tracked objects
non-public var sodaTracker: ObjectTrackingProvider?
/// Assortment of reference objects used for detection
/// These objects comprise the educated mannequin knowledge for recognizing soda cans
non-public var targetObjects: [ReferenceObject] = []
/**
Initializes the item detection system by loading and making ready
the reference object (Eating regimen Soda can) from the app bundle.
This technique hundreds a pre-trained mannequin that accommodates spatial and
visible details about the Eating regimen Soda can we wish to detect.
*/
func initializeDetector() async {
guard let objectURL = Bundle.principal.url(forResource: "Eating regimen Soda", withExtension: "referenceobject") else {
print("Error: Didn't find reference object in bundle - guarantee Eating regimen Soda.referenceobject exists")
return
}
do {
let referenceObject = attempt await ReferenceObject(from: objectURL)
self.targetObjects = [referenceObject]
} catch {
print("Error: Didn't initialize reference object: (error)")
}
}
/**
Begins the lively object detection course of utilizing ARKit.
This technique initializes the monitoring supplier with loaded reference objects
and begins the real-time detection course of within the consumer's surroundings.
Returns: An ObjectTrackingProvider if efficiently initialized, nil in any other case
*/
func beginDetection() async -> ObjectTrackingProvider? {
guard !targetObjects.isEmpty else { return nil }
let tracker = ObjectTrackingProvider(referenceObjects: targetObjects)
do {
attempt await arSession.run([tracker])
self.sodaTracker = tracker
return tracker
} catch {
print("Error: Didn't initialize monitoring: (error)")
return nil
}
}
/**
Terminates the item detection course of.
This technique safely stops the ARKit session and cleans up
monitoring sources when object detection is now not wanted.
*/
func endDetection() {
arSession.cease()
}
}
On the core of our implementation is ARKitSession, visionOS’s gateway to spatial computing capabilities. The @MainActor attribute ensures our object detection operations run on the primary thread, which is essential for synchronizing with the rendering pipeline.
non-public var arSession = ARKitSession()
non-public var sodaTracker: ObjectTrackingProvider?
non-public var targetObjects: [ReferenceObject] = []
The ObjectTrackingProvider is a specialised part in visionOS that handles real-time object detection. It really works together with ReferenceObject cases, which comprise the spatial and visible info from our educated mannequin. We keep these as non-public properties to make sure correct lifecycle administration.
The initialization course of is especially necessary:
let referenceObject = attempt await ReferenceObject(from: objectURL)
self.targetObjects = [referenceObject]
Right here, we load our educated mannequin (the .referenceobject file we created in Create ML) right into a ReferenceObject occasion. This course of is asynchronous as a result of the system must parse and put together the mannequin knowledge for real-time detection.
The beginDetection technique units up the precise monitoring course of:
let tracker = ObjectTrackingProvider(referenceObjects: targetObjects)
attempt await arSession.run([tracker])
Once we create the ObjectTrackingProvider, we move in our reference objects. The supplier makes use of these to ascertain the detection parameters — what to search for, what options to match, and monitor the item in 3D area. The ARKitSession.run name prompts the monitoring system, starting the real-time evaluation of the consumer’s surroundings.
Immersive Expertise Implementation
ImmersiveView.swift, supplied in our preliminary undertaking setup, manages the real-time object detection visualization within the consumer’s area. This view processes the continual stream of detection knowledge and creates visible representations of detected objects. Right here’s the implementation:
import SwiftUI
import RealityKit
import ARKit
/**
ImmersiveView is chargeable for creating and managing the augmented actuality
expertise the place object detection happens. This view handles the real-time
visualization of detected soda cans within the consumer's surroundings.
It maintains a set of visible representations for every detected object
and updates them in real-time as objects are detected, moved, or eliminated
from view.
*/
struct ImmersiveView: View {
/// Entry to the app's shared mannequin for object detection performance
@Atmosphere(AppModel.self) non-public var appModel
/// Root entity that serves because the father or mother for all AR content material
/// This entity offers a constant coordinate area for all visualizations
@State non-public var sceneRoot = Entity()
/// Maps distinctive object identifiers to their visible representations
/// Allows environment friendly updating of particular object visualizations
@State non-public var activeVisualizations: [UUID: ObjectVisualization] = [:]
var physique: some View {
RealityView { content material in
// Initialize the AR scene with our root entity
content material.add(sceneRoot)
Activity {
// Start object detection and monitor adjustments
let detector = await appModel.beginDetection()
guard let detector else { return }
// Course of real-time updates for object detection
for await replace in detector.anchorUpdates {
let anchor = replace.anchor
let id = anchor.id
change replace.occasion {
case .added:
// Object newly detected - create and add visualization
let visualization = ObjectVisualization(for: anchor)
activeVisualizations[id] = visualization
sceneRoot.addChild(visualization.entity)
case .up to date:
// Object moved - replace its place and orientation
activeVisualizations[id]?.refreshTracking(with: anchor)
case .eliminated:
// Object now not seen - take away its visualization
activeVisualizations[id]?.entity.removeFromParent()
activeVisualizations.removeValue(forKey: id)
}
}
}
}
.onDisappear {
// Clear up AR sources when view is dismissed
cleanupVisualizations()
}
}
/**
Removes all lively visualizations and stops object detection.
This ensures correct cleanup of AR sources when the view is now not lively.
*/
non-public func cleanupVisualizations() {
for (_, visualization) in activeVisualizations {
visualization.entity.removeFromParent()
}
activeVisualizations.removeAll()
appModel.endDetection()
}
}
The core of our object monitoring visualization lies within the detector’s anchorUpdates stream. This ARKit characteristic offers a steady circulate of object detection occasions:
for await replace in detector.anchorUpdates {
let anchor = replace.anchor
let id = anchor.id
change replace.occasion {
case .added:
// Object first detected
case .up to date:
// Object place modified
case .eliminated:
// Object now not seen
}
}
Every ObjectAnchor accommodates essential spatial knowledge in regards to the detected soda can, together with its place, orientation, and bounding field in 3D area. When a brand new object is detected (.added occasion), we create a visualization that RealityKit will render within the right place relative to the bodily object. As the item or consumer strikes, the .up to date occasions guarantee our digital content material stays completely aligned with the actual world.
Visible Suggestions System
Create a brand new file named ObjectVisualization.swift for dealing with the visible illustration of detected objects. This part is chargeable for creating and managing the bounding field and textual content overlay that seems round detected soda cans:
import RealityKit
import ARKit
import UIKit
import SwiftUI
/**
ObjectVisualization manages the visible parts that seem when a soda can is detected.
This class handles each the 3D textual content label that seems above the item and the
bounding field that outlines the detected object in area.
*/
@MainActor
class ObjectVisualization {
/// Root entity that accommodates all visible parts
var entity: Entity
/// Entity particularly for the bounding field visualization
non-public var boundingBox: Entity
/// Width of bounding field traces - 0.003 offers optimum visibility with out being too intrusive
non-public let outlineWidth: Float = 0.003
init(for anchor: ObjectAnchor) {
entity = Entity()
boundingBox = Entity()
// Arrange the primary entity's remodel based mostly on the detected object's place
entity.remodel = Remodel(matrix: anchor.originFromAnchorTransform)
entity.isEnabled = anchor.isTracked
createFloatingLabel(for: anchor)
setupBoundingBox(for: anchor)
refreshBoundingBoxGeometry(with: anchor)
}
/**
Creates a floating textual content label that hovers above the detected object.
The textual content makes use of Avenir Subsequent font for optimum readability in AR area and
is positioned barely above the item for clear visibility.
*/
non-public func createFloatingLabel(for anchor: ObjectAnchor) {
// 0.06 items offers optimum textual content dimension for viewing at typical distances
let labelSize: Float = 0.06
// Use Avenir Subsequent for its readability and trendy look in AR
let font = MeshResource.Font(identify: "Avenir Subsequent", dimension: CGFloat(labelSize))!
let textMesh = MeshResource.generateText("Eating regimen Soda",
extrusionDepth: labelSize * 0.15,
font: font)
// Create a cloth that makes textual content clearly seen in opposition to any background
var textMaterial = UnlitMaterial()
textMaterial.shade = .init(tint: .orange)
let textEntity = ModelEntity(mesh: textMesh, supplies: [textMaterial])
// Place textual content above object with sufficient clearance to keep away from intersection
textEntity.remodel.translation = SIMD3(
anchor.boundingBox.middle.x - textMesh.bounds.max.x / 2,
anchor.boundingBox.extent.y + labelSize * 1.5,
0
)
entity.addChild(textEntity)
}
/**
Creates a bounding field visualization that outlines the detected object.
Makes use of a magenta shade transparency to offer a transparent
however non-distracting visible boundary across the detected soda can.
*/
non-public func setupBoundingBox(for anchor: ObjectAnchor) {
let boxMesh = MeshResource.generateBox(dimension: [1.0, 1.0, 1.0])
// Create a single materials for all edges with magenta shade
let boundsMaterial = UnlitMaterial(shade: .magenta.withAlphaComponent(0.4))
// Create all edges with uniform look
for _ in 0..<12 {
let edge = ModelEntity(mesh: boxMesh, supplies: [boundsMaterial])
boundingBox.addChild(edge)
}
entity.addChild(boundingBox)
}
/**
Updates the visualization when the tracked object strikes.
This ensures the bounding field and textual content keep correct positioning
relative to the bodily object being tracked.
*/
func refreshTracking(with anchor: ObjectAnchor) {
entity.isEnabled = anchor.isTracked
guard anchor.isTracked else { return }
entity.remodel = Remodel(matrix: anchor.originFromAnchorTransform)
refreshBoundingBoxGeometry(with: anchor)
}
/**
Updates the bounding field geometry to match the detected object's dimensions.
Creates a exact define that precisely matches the bodily object's boundaries
whereas sustaining the gradient visible impact.
*/
non-public func refreshBoundingBoxGeometry(with anchor: ObjectAnchor) {
let extent = anchor.boundingBox.extent
boundingBox.remodel.translation = anchor.boundingBox.middle
for (index, edge) in boundingBox.youngsters.enumerated() {
guard let edge = edge as? ModelEntity else { proceed }
change index {
case 0...3: // Horizontal edges alongside width
edge.scale = SIMD3(extent.x, outlineWidth, outlineWidth)
edge.place = [
0,
extent.y / 2 * (index % 2 == 0 ? -1 : 1),
extent.z / 2 * (index < 2 ? -1 : 1)
]
case 4...7: // Vertical edges alongside top
edge.scale = SIMD3(outlineWidth, extent.y, outlineWidth)
edge.place = [
extent.x / 2 * (index % 2 == 0 ? -1 : 1),
0,
extent.z / 2 * (index < 6 ? -1 : 1)
]
case 8...11: // Depth edges
edge.scale = SIMD3(outlineWidth, outlineWidth, extent.z)
edge.place = [
extent.x / 2 * (index % 2 == 0 ? -1 : 1),
extent.y / 2 * (index < 10 ? -1 : 1),
0
]
default:
break
}
}
}
}
The bounding field creation is a key facet of our visualization. Somewhat than utilizing a single field mesh, we assemble 12 particular person edges that kind a wireframe define. This method offers higher visible readability and permits for extra exact management over the looks. The perimeters are positioned utilizing SIMD3 vectors for environment friendly spatial calculations:
edge.place = [
extent.x / 2 * (index % 2 == 0 ? -1 : 1),
extent.y / 2 * (index < 10 ? -1 : 1),
0
]
This mathematical positioning ensures every edge aligns completely with the detected object’s dimensions. The calculation makes use of the item’s extent (width, top, depth) and creates a symmetrical association round its middle level.
This visualization system works together with our ImmersiveView to create real-time visible suggestions. Because the ImmersiveView receives place updates from ARKit, it calls refreshTracking on our visualization, which updates the remodel matrices to keep up exact alignment between the digital overlays and the bodily object.
Informative View

ContentView.swift, supplied in our undertaking template, handles the informational interface for our app. Right here’s the implementation:
import SwiftUI
import RealityKit
import RealityKitContent
/**
ContentView offers the primary window interface for the applying.
Shows a rotating 3D mannequin of the goal object (Eating regimen Soda can)
together with clear directions for customers on use the detection characteristic.
*/
struct ContentView: View {
// State to manage the continual rotation animation
@State non-public var rotation: Double = 0
var physique: some View {
VStack(spacing: 30) {
// 3D mannequin show with rotation animation
Model3D(named: "SodaModel", bundle: realityKitContentBundle)
.padding(.vertical, 20)
.body(width: 200, top: 200)
.rotation3DEffect(
.levels(rotation),
axis: (x: 0, y: 1, z: 0)
)
.onAppear {
// Create steady rotation animation
withAnimation(.linear(period: 5.0).repeatForever(autoreverses: true)) {
rotation = 180
}
}
// Directions for customers
VStack(spacing: 15) {
Textual content("Eating regimen Soda Detection")
.font(.title)
.fontWeight(.daring)
Textual content("Maintain your eating regimen soda can in entrance of you to see it mechanically detected and highlighted in your area.")
.font(.physique)
.multilineTextAlignment(.middle)
.foregroundColor(.secondary)
.padding(.horizontal)
}
}
.padding()
.body(maxWidth: 400)
}
}
This implementation shows our 3D-scanned soda mannequin (SodaModel.usdz) with a rotating animation, offering customers with a transparent reference of what the system is searching for. The rotation helps customers perceive current the item for optimum detection.
With these parts in place, our software now offers an entire object detection expertise. The system makes use of our educated mannequin to acknowledge eating regimen soda cans, creates exact visible indicators in real-time, and offers clear consumer steering by way of the informational interface.
Conclusion

On this tutorial, we’ve constructed an entire object detection system for visionOS that showcases the mixing of a number of highly effective applied sciences. Ranging from 3D object seize, by way of ML mannequin coaching in Create ML, to real-time detection utilizing ARKit and RealityKit, we’ve created an app that seamlessly detects and tracks objects within the consumer’s area.
This implementation represents only the start of what’s doable with on-device machine studying in spatial computing. As {hardware} continues to evolve with extra highly effective Neural Engines and devoted ML accelerators and frameworks like Core ML mature, we’ll see more and more refined functions that may perceive and work together with our bodily world in real-time. The mix of spatial computing and on-device ML opens up prospects for functions starting from superior AR experiences to clever environmental understanding, all whereas sustaining consumer privateness and low latency.
Source link