BYOB - Build Your Own Beijing | Crazy, game-changing things you can do with latest AI

algertc · May 25, 2025

The object detection + ALPR setup is very solid, but thanks to the billions of dollars happily flying out of big tech to keep the big gen AI party going, we realistically could have what feels like DIY China.

I know there are a handful of clever forum users running interesting custom models with similar goals, and some who use audio triggers, but by and large, we haven't been able to fully or reliably do any of the following three extremely valuable things:

Scene classification (+vehicle desc)
Sound classification/Speech transcription
Holistically leverage and make sense of all the data our systems generate (this one being my motivation to create the ALPR database).

This isn't something that is specific to BI. It is an NVR after all, and we are lucky to have the smart people who have developed all the other tools and AI that we integrate. There aren't really any consumer NVRs that offer any substantially rich classification. But what if we could...

Let me paint a picture for you of what this could look like:

I've been testing out all sorts of image and video classification experiments with these massive multi-modal models, and the reasoning capability, accuracy, and level of granularity are unbelievably impressive. So is some of the newer facial recognition. Google's Gemini 2.5 is also extremely cheap compared to OpenAI.

Here are a few examples from my Gemini 2.0 test for video analysis (not posting my cams). I did absolutely no adjustment to the default "describe the video" prompt until the last one with the vehicle where I asked for slightly more specific info. You can test it for free in a notebook here.

---Experiments on 4 different videos---

“A security camera positioned at a doorway captures a man walking up to the door. He is wearing a blue t-shirt, black shorts, and athletic shoes. He has a mobile phone in his left hand. As he approaches the door, he reaches out with his right hand to open it, then enters the house. The door is labeled "170 Door" in the corner of the frame. The doorway features a "Hello" doormat, a white bench to the left of the door, and a blue doorbell.”

"Here's a detailed summary of the video:

Dog Appears: A medium-sized black dog with a leash attached walks into the camera's view from the left.

Investigates the Door: The dog sniffs and scratches at the front door.

Woman Arrives:A woman wearing a pink cap, white top, and black leggings walks into the frame.

Greets the Dog: She says, "Hi Parker!" and takes hold of the dog's leash.

Leaves with Dog: The woman and the dog walk away, exiting the frame on the left.“

"Here's a detailed summary of the video: A man wearing a black t-shirt and white shorts, exits a front door. He picks up a package that appears to be a FedEx delivery, and takes it inside, leaving his sandals at the entrance. He then closes the door."

“Event Details:
Vehicle approaches from [redacted street] and drives into view.
It proceeds along the street.
The car appears to be traveling at a regular speed.

Vehicle Details:
Make: Volvo
Model: XC60 (based on the visual appearance)
Color: Red

Occupant Details:
There appears to be at least one person in the driver's seat.
The driver appears to be female, with red hair, her age would be somewhere between 25 and 40.
There appears to be a dog in the front passenger seat.

Direction of Travel:
The vehicle appears to be heading away from the camera's position.”

So why exactly is this so monumental?

Well, first reason is the accuracy and ability to discriminate situations more granularly. You get not only detection, but an actual analysis of what is going on in the scene such as a person's behavior.

For example, if someone is in your driveway in the middle of the night wearing a ski mask, messing with the garage door, that's really nice information to include in an alert as opposed to "Person" or cross zone x.

Similarly, an alert for a "person at door" versus person at door previously seen on street holding pistol... two very different reactions there.

The way bigger overall reason is the capability that this level of intelligence enables downstream. Extremely complex and highly customizable automations, the ability to make complex queries into your footage/property history about literally anything – either to feed other services, generate analytics/reports, or just answer ad hoc questions/do research incredibly quickly.

But what does that actually mean for your system?

It lets you do A LOT. Things you wouldn't even think about, and practically anything you might want.

I first started thinking about this in response to ALPR database feature requests for the ability to feed other services like Home Assistant or create custom automations. Sending to HA is simple, but there's a lot that goes into letting users dynamically create any automation they want. While thinking about this, I discovered that you can actually self-host a tool that has exploded in popularity recently called N8n. It lets users easily create agentic-AI workflows with minimal or zero coding. It hadn't ever occurred to me how something like this would be useful for a security camera setup, but I will show you a number of ways that N8n and using LLMs (either local or cloud) could completely change the game, especially using retrieval-augmented generation (RAG).

Here are a whole bunch of example use cases that came to mind for me:

General Automation

Person detected? Well, what are they doing and who are they? Define whatever criteria you like to determine what should happen. You can get extremely customizable with this. Almost like having someone there watching to tell you.
Package delivered? Add to BI memo and send notification with image and service (Amazon, FedEx, etc.)
Package picked up and taken away from doorstep instead of inside? Possible porch pirate, save and notify.
Have certain people who access your property regularly? Give that or anything else as context and use that to inform the AI. Ex: gardener arrived.
On the lookout for anything particular like a description of a person? Tell it that and save any potential matches
Get a report of what happened on your property for the week, month, etc. Could be as general or as granular as you like. Ex: 33 new people this week, 12 deliveries – 8 amazon 4 UPS, etc.

Vehicles & ALPR

For deliveries, what was delivered and who delivered it? Can use that to tag the vehicle accordingly and add a driver description
Identify types of vehicles such as delivery (with the specific company), home services, etc
Check if windows are down or if we can see a face through the windshield. If yes, save the image or clip and associate it to the vehicle.
- Additionally, you could try to use facial recognition at this point. Not likely to work well through windshield, but if window is open and you have a clear shot, you might be able to link a person to the vehicle they drive if they walk by on foot at some other time.
Vehicle with no plate passes by: check if we are looking at the front or back of the vehicle. If back and no plate, save the image/clip and store any identifiable information about the vehicle. (General desc, stickers, driver desc if possible, stickers, damage, any other markings). By saving these things, you build a detailed knowledgebase that can help identify these vehicles if needed.
Identify paper plates
Vehicle behavior: Did the vehicle park? Did anyone exit the vehicle? Where did they go afterwards? This can be useful for lots of things. Ex: vehicle parks, people exit, vehicle leaves → possibly uber/rideshare. Vehicle driving really slow or doing something else strange → maybe tag to take a look at.

Quality of Life

If you have a car with mirrors that fold when locked, have it check, either after exiting or maybe at some time in the evening, whether or not you remembered to lock the car. If the mirrors are out, maybe send you a text or notification telling you to lock the car.
Evening before trash day and the cans are not out? The AI can remind you.
Have a grill or outdoor furniture that should be covered before heavy rain? N8n can subscribe to a weather feed and check your cameras if rain is coming and advise you to cover the furniture.
Has the mail come yet today? No need to go check. Add it to an automation for whatever camera looks at your mailbox and have a yes/no answer every day.

Audio analysis

How many people are speaking? What are they saying? Are they talking normally or yelling? Referencing anything you are looking out for?
Depends on your property what's unusual and what's not, but for me, speech audible from my yard or street late at night is unusual. If speech is detected, that's something I want transcribed and analyzed. If it's just random neighbor conversation, probably fine to ignore. Otherwise, store the transcript and clip with any keywords for review.

Reliability

Check periodically for obstructions of view. If a camera is obstructed, how long has it been that way? What's the nature of the obstruction? Possibly notify you to resolve it.
Secondary validation. If you are having false positive issues, you could add some secondary validation and decide what to do based on the output of that.

Highly dynamic RAG / Deep Research

By being able to access your video database, you can ask the AI basically anything and have it comb through hundreds of alerts or hours of footage for you, either looking for something specific, summarizing events, counting things, etc.

Find some damage on your property that went unnoticed? Ask the AI to find out when it happened.
Have a vehicle of interest from your plate database that you want to investigate? Ask the AI to go check all of the times it has passed and report on its findings. This could include the travel habits of a vehicle with extreme detail. Generate a report showing when they pass, which direction, which seem like round trips, etc. Almost scary.
Have a person of interest or physical description? If you want to really hammer your AI, give it some criteria for what and where to search and let it watch everything back and investigate for you.
Maybe you have a landscaper/gardener and are curious what they are actually working on while at your house. Ask and get a report of what they appeared to be doing over some period of dates.

Just for fun / Other

If you have animals or maybe an indoor camera and dogs or cats, what did they do while you were gone? You can ask and get a report.
Maybe you have a garden or crops that are getting munched by wildlife. Have the AI watch the alerts and tell you what's going on. You could get a notification like: "Family of 3 deer jumped garden fence and appeared to be eating your tomato plant at 2:37 am last night"
Maybe you like planes and live near an airport or something and have a PTZ. Tap into flightradar and create a PTZ preset to snap some cool zoomed-in pictures of planes flying over.

How do you go about setting something like this up?

I'm going to play around with it and will post screenshots and guidance as I get it going. I tested several of these just to see if they would work, and all of my tests were successful. While I don't want to fragment the system into more parts, BI is somewhat limited in terms of storing richer data like this. I think it could be interesting to have a simple event and notification feed dashboard with different criteria to sort by like priority, tags, event description, faces, vehicles, etc. I'm thinking about storing some of this data in a second database. You don't want to save duplicate copies of footage that is saved in BI, but other metadata and JPEGs can be stored with minimal space, allowing you to retain a significantly longer history of events. I would guess that most people could store a full year without needing to add storage.

With regard to selection and deployment of AI models, I am looking at a hybrid approach with Qwen2VL locally, and Google Gemini in the cloud. Gemini is better, but you can't hammer it with requests the same way you could something local without running up a bill. Overall, it is actually not too expensive though and could be used on its own, depending on what your goals are. Based on my calculations, it would cost about 15 dollars/month to analyze 800 12-second 1080p clips per day.

This is not a substitute for the regular convolutional models we use with CodeProject. These models are far slower and not suitable for standard first line of detection. The idea is to use them in conjunction with your existing setup to enhance the capabilities. (They aren't slow to use, we're talking 5-60 seconds, but that's too slow for your main detection)

Something that came to mind that would be cool is an Alexa skill that allows you to ask things about your system. What happened while I was gone? When was the last time David was here? Did an exterminator come on Monday? When was the last time I got a DHL package? etc.

The other really nice thing about N8n agents is that they can be easily shared with others. They are also designed to be easy to create and use without coding. Any code required is usually short and can be generated by claude or chatgpt.

Conclusion

I'm really interested to hear what others think about this or any ideas people might have. I'm still in the tinkering phase right now, but my mind was blown thinking about all the possibilities, so I wanted to share. While not trivial to set up, it isn't crazy complicated. I think it will just be time-consuming in a similar way that setting up BI or HomeAssistant is - you are customizing everything for your needs and figuring out what works along the way.

When I have some N8n agents going, I'll share them here and make a video guide showing exactly how to set them up and customize them. Best part about it once you get going is that you can create exactly what you need or whatever you can think of.

Gimmons · May 26, 2025

I've noticed that each improvement in the amount of information available creates a need for a corresponding degree of analysis. When I first set up cameras, it was a big deal just to be able to look at what was happening, but soon I needed to set up alerts so I could know when something important was happening without staring at the monitor all day.

Now I have so many alerts, I need some deeper level of analysis, such as your alpr database, to sift them and find the ones that matter.

Even though my system now knows the difference between a person walking up my drive and a tree branch swaying, it can't tell me whether it's the mailman, a solar panel salesman, or a porch pirate.

What I'd really like is a robot dog that would greet the mailman and bring the mail in, tell the solar panel salesman we rent, and bite the porch pirate until he drops my package. And if it bit the mailman by mistake, it should say it belongs to the people next door.

Vettester · May 26, 2025

Gimmons said:
Even though my system now knows the difference between a person walking up my drive and a tree branch swaying, it can't tell me whether it's the mailman, a solar panel salesman, or a porch pirate.

Home Assistant can use AI to analyze an image.

algertc · May 26, 2025

Vettester said:
Home Assistant can use AI to analyze an image.

View attachment 221603

Honestly still pretty impressive that it can do that, but it's really the video analysis that makes things so much easier. Otherwise you sort of have to work something out manually to have it check multiple frames and reason across all of them.

The reason I'm so interested in N8n is that you can configure it to do so much with that info versus just getting the analysis output. It would require so much code and time to do some of those examples from scratch.

Gimmons · May 26, 2025

Vettester said:
Home Assistant can use AI to analyze an image.

View attachment 221603

Yes, I have read posts about custom models that ID various trucks, and perhaps uniforms? Still waiting for the "pre-crime" custom model that tells me the guy walking around outside has larceny in his heart. I'm sure the wait won't be long

However precisely we can discriminate among the data points, we need an equally precise way to sort the ones of interest. And then a robot dog to deal with them.

Vettester · May 26, 2025

Gimmons said:
Yes, I have read posts about custom models that ID various trucks, and perhaps uniforms?

This isn’t using a model. There’s a sensor in the mailbox that triggers the AI analysis' when the door is opened. Using this methodology it also sends me an image analysis of whoever opened the mailbox.

algertc · May 26, 2025

Vettester said:
This isn’t using a model. There’s a sensor in the mailbox that triggers the AI analysis' when the door is opened. Using this methodology it also sends me an image analysis of whoever opened the mailbox.

Clever - what sensors are you using?

Vettester · May 26, 2025

algertc said:
Clever - what sensors are you using?

I’m using a cheap Govee Bluetooth vibration sensor. I tried using a zigbee and zwave sensor but neither worked mainly because it’s a heavy gauge steel mailbox surrounded by brick. The Govee sensor works well because I have several Bluetooth proxies scattered throughout my house and garage.

algertc · May 26, 2025

Vettester said:
I’m using a cheap Govee Bluetooth vibration sensor. I tried using a zigbee and zwave sensor but neither worked mainly because it’s a heavy gauge steel mailbox surrounded by brick. The Govee sensor works well because I have several Bluetooth proxies scattered throughout my house and garage.

Very cool

PeteJ · May 27, 2025

I'd love to see a gait recognition model that can be used to uniquely identify a person by their movement. The post-covid world has made wearing face coverings generally accepted, even on a hot sunny day, and has made identification much more difficult. If each person that is seen by the camera can be tagged with a unique ID, we can than query the database for other occurences of that ID.

Gimmons · May 27, 2025

Hmmm. How accurate is that?

PeteJ · May 27, 2025

Gimmons said:
Hmmm. How accurate is that?

I think China uses it at many of their airports, but I can only find data on IMU-based system and marked-based systems, and they are very accurate, but don't know about video only based one. The difficulty is that this cannot be image-based, it has to be video-based where multiple frames need to be seen to determine the kinematics.

samplenhold · May 27, 2025

I know that I can view my cams and see folks walking or jogging in the neighborhood and recognize them before I can see a face. Pretty much every time. But probably body shape also plays into it.

Also can tell male or female by the gait.

PeteJ · May 27, 2025

I know we are all thinking about adding these new AI features as "real-time" features, but it would be great to see some features added as post-process features. For example, if we can pull images from ALPR database and have AI retro-actively add make/model/color to it when cycles are available/free/cheap, that would greatly enhance the utility of the database w/o putting significant demand on local infrastructure.

I suspect that cloud-based AI services at some point will change to a pricing model that is tier-based, especially when there are generational upgrades where the slower hw will be used for less demanding workloads that are not time sensitive. The price per token is relatively cheap right now. but as demand increases, the pricing will trend up.

algertc · May 27, 2025

PeteJ said:
I know we are all thinking about adding these new AI features as "real-time" features, but it would be great to see some features added as post-process features. For example, if we can pull images from ALPR database and have AI retro-actively add make/model/color to it when cycles are available/free/cheap, that would greatly enhance the utility of the database w/o putting significant demand on local infrastructure.

I suspect that cloud-based AI services at some point will change to a pricing model that is tier-based, especially when there are generational upgrades where the slower hw will be used for less demanding workloads that are not time sensitive. The price per token is relatively cheap right now. but as demand increases, the pricing will trend up.

This is exactly what I am making rn. Not too hard to do at all.

I started trying to get it to answer some really elaborate questions based on the vehicle history and it wasn’t doing fantastic, but I think it’s a matter of prompting well at each stage of the workflow.

it’s definitely going to work, just requires some tinkering. A cool interface to have in the app would be a chatgpt style messaging page that can trigger the agents and access all the blue Iris footage and ALPR data.

That would use the exact same stuff on the backend, but give a nice easy UI inside the ALPR database app to use them with.

UriahJimenez · Jun 2, 2025

Currently, artificial AI platforms such as apk permainan are being widely developed, especially in Southeast Asian and West Asian countries.

Gimmons · Jun 4, 2025

PeteJ said:
I know we are all thinking about adding these new AI features as "real-time" features, but it would be great to see some features added as post-process features. For example, if we can pull images from ALPR database and have AI retro-actively add make/model/color to it when cycles are available/free/cheap, that would greatly enhance the utility of the database w/o putting significant demand on local infrastructure.

I suspect that cloud-based AI services at some point will change to a pricing model that is tier-based, especially when there are generational upgrades where the slower hw will be used for less demanding workloads that are not time sensitive. The price per token is relatively cheap right now. but as demand increases, the pricing will trend up.

I'd like to send the computer through the database to clean up misread plates. Things like Bxxxnnn which should be 8xxxnnn. This should be easy for ai given that a common pattern for recent plates is (7, 8 or 9) xxx(letters) nnn (numbers).
Also, if it saw 10 reads of a plate, say 8xxxnnn, and then one of 8xxxnny, it should flag the outlier for review. I'm not sure this even requires AI. It might just involve tweaking a spellcheck algorithm.

Gimmons · Jun 4, 2025

A little digging suggests some methods:

Levenshtein distance

Description: This algorithm calculates the minimum number of single-character edits (insertions, deletions, substitutions) required to change one string into another. You can set a threshold to determine if two license plates are similar enough.
Use Case: Useful for identifying minor errors, such as misread characters (e.g., 'O' snd '0'.)

Fuzzy analysis (Does Alpr db already have this?)

Description: Use libraries like FuzzyWuzzy (in Python) that implement fuzzy string matching techniques to find close matches based on a similarity score.
Use Case: Effective for identifying potential errors in license plate reads that are not exact matches.

Gimmons · Jun 4, 2025

One thing I love about algertc's alpr database is that I can run it, Blue Iris and Codeproject AI all on a very low power pc, though I may have to add a gpu. I wonder how much bloat would be involved in making the analysis more powerful?

TheWaterbug · Jun 4, 2025

Vettester said:
This isn’t using a model. There’s a sensor in the mailbox that triggers the AI analysis' when the door is opened. Using this methodology it also sends me an image analysis of whoever opened the mailbox.

I found it easier and cheaper (for me) to put a tiny, cheap POE camera inside the mailbox:

BYOB - Build Your Own Beijing | Crazy, game-changing things you can do with latest AI

Let me paint a picture for you of what this could look like:​

---Experiments on 4 different videos---​

So why exactly is this so monumental?​

But what does that actually mean for your system?​

​

Here are a whole bunch of example use cases that came to mind for me:​

General Automation​

Vehicles & ALPR​

Quality of Life​

Audio analysis​

Reliability​

Highly dynamic RAG / Deep Research​

Just for fun / Other​

How do you go about setting something like this up?​

Conclusion​

Getting comfortable

Getting comfortable

Getting comfortable

Getting comfortable

Getting comfortable

Pulling my weight

Getting comfortable

Pulling my weight

Pulling my weight

n3wb

Getting comfortable

Getting comfortable

Levenshtein distance​

Fuzzy analysis (Does Alpr db already have this?)​

​

Getting comfortable

Known around here

Let me paint a picture for you of what this could look like:

---Experiments on 4 different videos---

So why exactly is this so monumental?

But what does that actually mean for your system?

Here are a whole bunch of example use cases that came to mind for me:

General Automation

Vehicles & ALPR

Quality of Life

Audio analysis

Reliability

Highly dynamic RAG / Deep Research

Just for fun / Other

How do you go about setting something like this up?

Conclusion

Levenshtein distance

Fuzzy analysis (Does Alpr db already have this?)