For now, option 1 seems like better choice:
Choose 1 first, but not as a thin analytics add-on.
The best path is:
1. Add intelligence and operational depth now
Then build a selective “video-assisted entry” layer later
Do not jump straight to full video as the main next move
My actual ranking for your case is:
- Something else: turn the demo into an event decision system
- Option 1: add intelligence, event history, confidence, and operator workflow
- Option 2: move to video later, only for specific bottlenecks
That is the strongest path because the market already has vendors doing face-based event check-in, the real differentiation is in workflow and trust, and video adds a large new layer of technical complexity. Wicket and InEvent already market facial event check-in and access control, while NVIDIA’s current multi-camera workflow shows that video systems quickly become detection + tracking + ID management + storage + analytics platforms, not just “more frames.” (Wicket)
The core judgment
Your demo has already solved the easiest part to explain:
- register a face
- detect a face
- match a face
- mark attendance
The next valuable step is not “increase input from image to video.”
It is to answer the real product questions:
- Who attended which events?
- When did they arrive?
- Which gate did they use?
- Was the match strong or weak?
- Did the system auto-approve or did a human review it?
- What happened when the face was low quality or consent was missing?
That matters because NIST evaluates face recognition as a thresholded tradeoff problem in both 1:1 verification and 1:N identification. In plain words, every real deployment must manage false accepts and false rejects, not just maximize demo accuracy. (NIST Pages)
Why option 1 is better than option 2 right now
1. Option 1 creates product value faster
Adding event intelligence makes the system useful to organizers immediately.
Commercial event platforms already position facial check-in inside a broader operations workflow: access control, analytics, onboarding, and onsite execution. Wicket’s Dreamforce 2024 case study says Digital Pass was integrated into registration, opt-in only, and made check-in and badge printing 3x faster; InEvent positions facial recognition as part of check-in, access control, and performance analytics. (Wicket)
That tells you something important:
the buyer does not really want “face recognition.”
The buyer wants faster, safer, auditable event operations. That is where option 1 wins.
2. Option 2 is much harder than it looks
Moving from image to video is not a small upgrade. It changes the system class.
With image or guided kiosk capture, the flow is mostly:
detect → align → embed → compare → decide
With video, the flow becomes:
detect per frame → track across frames → select best frames → suppress duplicates → handle side views/occlusion → maintain persistent IDs → possibly re-identify across cameras → aggregate evidence over time
NVIDIA’s own multi-camera workflow lists object detection, feature embeddings, multi-camera tracking, global ID generation, storage, API outputs, and browser visualization. DeepStream’s tracker docs explicitly describe persistent IDs over time, re-identification features, and target re-association. ByteTrack’s README explains why this is hard: low-score detections often contain true objects under occlusion, so trackers need special association logic to recover them. (NVIDIA)
So if you choose option 2 now, you are not adding one feature.
You are starting a second system.
3. Public issues already show where video breaks
This is where GitHub issues are more useful than polished demos.
In one InsightFace issue, a developer trying to count unique people in video says there are really 4 people, but the system emits around 10 IDs because side-angled faces become hard to recognize and new IDs get created. In another issue, a developer trying real-time streaming reports that the setup is extremely slow and logs show CPUExecutionProvider even though onnxruntime-gpu was installed. (GitHub)
That means your likely next pain points in video are not abstract:
- side-angle identity fragmentation
- duplicate identities for the same person
- missed detections
- runtime/provider misconfiguration
- latency that destroys the user experience
Those are real engineering costs.
Why “something else” should be your actual next move
The best move is not “just analytics.”
It is to build the layer that turns recognition into a trustworthy event system.
I would define that layer as:
Event intelligence + trust + operations
That means building these objects into the product:
- Person
- Enrollment
- Consent
- Event
- Gate / checkpoint
- Sighting
- Attendance decision
- Confidence score
- Manual review status
- Audit log
That gives you a much stronger product than “camera saw face.”
What this unlocks
It lets you answer:
- which users attended which events
- first seen time
- last seen time
- late arrivals
- no-shows
- duplicate entry attempts
- zone-level or session-level access
- uncertain matches needing review
- camera-specific failure patterns
- event-level match quality and review rate
This is where your system stops being a demo and starts becoming an operational product. That is also where you can differentiate from basic face-attendance clones, because the category already has vendors doing check-in, but fewer teams build strong confidence handling, review flow, and auditability. This is an inference from how current event vendors frame their products and from NIST’s thresholded evaluation model. (Wicket)
The hidden reason not to rush into video: trust and compliance
There are two trust problems here.
1. Biometric processing is regulated and sensitive
ICO guidance says that when using biometric recognition systems, you must identify both a lawful basis and a separate condition for processing special-category biometric data. It says that in many cases explicit consent is likely to be the most appropriate condition, and that if you rely on consent you must offer a suitable alternative and allow refusal or withdrawal without detriment. (ICO)
That means the better next feature is not passive video.
It is:
- explicit opt-in
- fallback flow
- deletion / retention rules
- auditability
- transparency
2. Some target markets are weak on necessity and proportionality
The ICO ordered Serco Leisure and associated trusts to stop using facial recognition and fingerprint scanning for employee attendance, saying more than 2,000 employees at 38 facilities had their biometric data processed unlawfully. (ICO)
So if your future thought is “maybe this should become employee attendance,” that is a warning sign.
For your case, controlled event access is stronger than generic workforce attendance.
Another practical issue: your current model path may be demo-only
InsightFace’s repo and PyPI page both say the code is MIT, but the provided pretrained models are for non-commercial research purposes only. (GitHub)
That matters because if you are serious about product direction, your next move should include:
- deciding whether the current stack stays demo-only
- licensing a commercial path
- or replacing the recognition component with a commercially usable stack
There is no point scaling product complexity on top of a model path that may not support commercialization.
So what should you build next, exactly?
Here is the path I would take.
Phase 1. Turn the demo into a real event system
Build:
- event history
- per-event attendance records
- first-seen / last-seen
- confidence score per decision
- uncertain-match queue
- admin review console
- explicit opt-in record
- fallback method like QR, badge, PIN, or manual desk review
This is your highest-return work. It increases product value, reduces risk, and gives you the data you need to justify later video work. It also aligns with the way real event deployments are framed today. (Wicket)
Phase 2. Add quality and anti-spoofing controls
NIST’s PAD material defines presentation attacks as attempts to interfere with biometric policy using artefacts or human characteristics, often for impersonation or evasion. (NIST Pages)
So before scaling input volume, add:
- enrollment quality checks
- best-photo guidance
- liveness / anti-spoofing
- threshold calibration by environment
- camera health monitoring
This improves trust more than jumping to video.
Phase 3. Add selective video only where it clearly pays off
After the system is trustworthy, add video-assisted entry at one controlled gate.
Use video for:
- smoother walk-up experience
- best-frame selection over 1–3 seconds
- fewer retries
- better throughput at peak entry windows
Do not start with venue-wide passive surveillance.
Start with one choke point where users currently pause too much or where image capture misses too many people.
When option 2 should move earlier
There are cases where video deserves to move up.
Move to video sooner only if most of these are true:
- users will not stop and face a kiosk
- people arrive in bursts and queues matter a lot
- multiple people appear at once
- you need passive capture at a gate
- you have several entrances
- you need to track a person for a short period before deciding
- missed check-ins are caused by motion and pose, not by missing product features
If that is your environment, then the right move is not “full video everywhere.”
It is video-assisted gate recognition.
What I would avoid
I would avoid these as your next main direction:
1. Full passive multi-camera tracking everywhere
Too much complexity too early. NVIDIA’s own materials show this becomes a full analytics and tracking platform. (NVIDIA)
2. Generic employee attendance
Too much regulatory exposure for a first market. (ICO)
3. “Better recognition accuracy” as the strategy
Important, but not enough. The category already exists. Product trust and workflow matter more.
My final recommendation
If I compress everything into one sentence:
Do not choose 2 yet. Choose 1, but expand it into “event intelligence + trust + fallback + review,” then add selective video later.
So the answer is:
- 1 beats 2 right now
- But the best real move is “something else” built around 1
- 2 becomes valuable only after your event system is operationally strong
The simple decision rule
If your current system is best described as:
“It recognizes faces and marks attendance.”
Then your next step should be:
“It makes reliable, auditable event decisions.”
Only after that should it become:
“It does that from short video at selected gates.”
That is the cleanest path from demo to product.