The object detection + ALPR setup is very solid, but thanks to the billions of dollars happily flying out of big tech to keep the big gen AI party going, we realistically could have what feels like DIY China.
I know there are a handful of clever forum users running interesting custom models with similar goals, and some who use audio triggers, but by and large, we haven't been able to fully or reliably do any of the following three extremely valuable things:
Here are a few examples from my Gemini 2.0 test for video analysis (not posting my cams). I did absolutely no adjustment to the default "describe the video" prompt until the last one with the vehicle where I asked for slightly more specific info. You can test it for free in a notebook here.
The way bigger overall reason is the capability that this level of intelligence enables downstream. Extremely complex and highly customizable automations, the ability to make complex queries into your footage/property history about literally anything – either to feed other services, generate analytics/reports, or just answer ad hoc questions/do research incredibly quickly.
I first started thinking about this in response to ALPR database feature requests for the ability to feed other services like Home Assistant or create custom automations. Sending to HA is simple, but there's a lot that goes into letting users dynamically create any automation they want. While thinking about this, I discovered that you can actually self-host a tool that has exploded in popularity recently called N8n. It lets users easily create agentic-AI workflows with minimal or zero coding. It hadn't ever occurred to me how something like this would be useful for a security camera setup, but I will show you a number of ways that N8n and using LLMs (either local or cloud) could completely change the game, especially using retrieval-augmented generation (RAG).
By being able to access your video database, you can ask the AI basically anything and have it comb through hundreds of alerts or hours of footage for you, either looking for something specific, summarizing events, counting things, etc.
I'm going to play around with it and will post screenshots and guidance as I get it going. I tested several of these just to see if they would work, and all of my tests were successful. While I don't want to fragment the system into more parts, BI is somewhat limited in terms of storing richer data like this. I think it could be interesting to have a simple event and notification feed dashboard with different criteria to sort by like priority, tags, event description, faces, vehicles, etc. I'm thinking about storing some of this data in a second database. You don't want to save duplicate copies of footage that is saved in BI, but other metadata and JPEGs can be stored with minimal space, allowing you to retain a significantly longer history of events. I would guess that most people could store a full year without needing to add storage.
With regard to selection and deployment of AI models, I am looking at a hybrid approach with Qwen2VL locally, and Google Gemini in the cloud. Gemini is better, but you can't hammer it with requests the same way you could something local without running up a bill. Overall, it is actually not too expensive though and could be used on its own, depending on what your goals are. Based on my calculations, it would cost about 15 dollars/month to analyze 800 12-second 1080p clips per day.
This is not a substitute for the regular convolutional models we use with CodeProject. These models are far slower and not suitable for standard first line of detection. The idea is to use them in conjunction with your existing setup to enhance the capabilities. (They aren't slow to use, we're talking 5-60 seconds, but that's too slow for your main detection)
Something that came to mind that would be cool is an Alexa skill that allows you to ask things about your system. What happened while I was gone? When was the last time David was here? Did an exterminator come on Monday? When was the last time I got a DHL package? etc.
The other really nice thing about N8n agents is that they can be easily shared with others. They are also designed to be easy to create and use without coding. Any code required is usually short and can be generated by claude or chatgpt.
I'm really interested to hear what others think about this or any ideas people might have. I'm still in the tinkering phase right now, but my mind was blown thinking about all the possibilities, so I wanted to share. While not trivial to set up, it isn't crazy complicated. I think it will just be time-consuming in a similar way that setting up BI or HomeAssistant is - you are customizing everything for your needs and figuring out what works along the way.
When I have some N8n agents going, I'll share them here and make a video guide showing exactly how to set them up and customize them. Best part about it once you get going is that you can create exactly what you need or whatever you can think of.
I know there are a handful of clever forum users running interesting custom models with similar goals, and some who use audio triggers, but by and large, we haven't been able to fully or reliably do any of the following three extremely valuable things:
- Scene classification (+vehicle desc)
- Sound classification/Speech transcription
- Holistically leverage and make sense of all the data our systems generate (this one being my motivation to create the ALPR database).
Let me paint a picture for you of what this could look like:
I've been testing out all sorts of image and video classification experiments with these massive multi-modal models, and the reasoning capability, accuracy, and level of granularity are unbelievably impressive. So is some of the newer facial recognition. Google's Gemini 2.5 is also extremely cheap compared to OpenAI.Here are a few examples from my Gemini 2.0 test for video analysis (not posting my cams). I did absolutely no adjustment to the default "describe the video" prompt until the last one with the vehicle where I asked for slightly more specific info. You can test it for free in a notebook here.
---Experiments on 4 different videos---
- “A security camera positioned at a doorway captures a man walking up to the door. He is wearing a blue t-shirt, black shorts, and athletic shoes. He has a mobile phone in his left hand. As he approaches the door, he reaches out with his right hand to open it, then enters the house. The door is labeled "170 Door" in the corner of the frame. The doorway features a "Hello" doormat, a white bench to the left of the door, and a blue doorbell.”
- "Here's a detailed summary of the video:
- Dog Appears: A medium-sized black dog with a leash attached walks into the camera's view from the left.
- Investigates the Door: The dog sniffs and scratches at the front door.
- Woman Arrives:A woman wearing a pink cap, white top, and black leggings walks into the frame.
- Greets the Dog: She says, "Hi Parker!" and takes hold of the dog's leash.
- Leaves with Dog: The woman and the dog walk away, exiting the frame on the left.“
- "Here's a detailed summary of the video: A man wearing a black t-shirt and white shorts, exits a front door. He picks up a package that appears to be a FedEx delivery, and takes it inside, leaving his sandals at the entrance. He then closes the door."
- “Event Details:
Vehicle approaches from [redacted street] and drives into view.
It proceeds along the street.
The car appears to be traveling at a regular speed.
Vehicle Details:
Make: Volvo
Model: XC60 (based on the visual appearance)
Color: Red
Occupant Details:
There appears to be at least one person in the driver's seat.
The driver appears to be female, with red hair, her age would be somewhere between 25 and 40.
There appears to be a dog in the front passenger seat.
Direction of Travel:
The vehicle appears to be heading away from the camera's position.”
So why exactly is this so monumental?
Well, first reason is the accuracy and ability to discriminate situations more granularly. You get not only detection, but an actual analysis of what is going on in the scene such as a person's behavior.For example, if someone is in your driveway in the middle of the night wearing a ski mask, messing with the garage door, that's really nice information to include in an alert as opposed to "Person" or cross zone x.
Similarly, an alert for a "person at door" versus person at door previously seen on street holding pistol... two very different reactions there.
The way bigger overall reason is the capability that this level of intelligence enables downstream. Extremely complex and highly customizable automations, the ability to make complex queries into your footage/property history about literally anything – either to feed other services, generate analytics/reports, or just answer ad hoc questions/do research incredibly quickly.
But what does that actually mean for your system?
It lets you do A LOT. Things you wouldn't even think about, and practically anything you might want.I first started thinking about this in response to ALPR database feature requests for the ability to feed other services like Home Assistant or create custom automations. Sending to HA is simple, but there's a lot that goes into letting users dynamically create any automation they want. While thinking about this, I discovered that you can actually self-host a tool that has exploded in popularity recently called N8n. It lets users easily create agentic-AI workflows with minimal or zero coding. It hadn't ever occurred to me how something like this would be useful for a security camera setup, but I will show you a number of ways that N8n and using LLMs (either local or cloud) could completely change the game, especially using retrieval-augmented generation (RAG).
Here are a whole bunch of example use cases that came to mind for me:
General Automation
- Person detected? Well, what are they doing and who are they? Define whatever criteria you like to determine what should happen. You can get extremely customizable with this. Almost like having someone there watching to tell you.
- Package delivered? Add to BI memo and send notification with image and service (Amazon, FedEx, etc.)
- Package picked up and taken away from doorstep instead of inside? Possible porch pirate, save and notify.
- Have certain people who access your property regularly? Give that or anything else as context and use that to inform the AI. Ex: gardener arrived.
- On the lookout for anything particular like a description of a person? Tell it that and save any potential matches
- Get a report of what happened on your property for the week, month, etc. Could be as general or as granular as you like. Ex: 33 new people this week, 12 deliveries – 8 amazon 4 UPS, etc.
Vehicles & ALPR
- For deliveries, what was delivered and who delivered it? Can use that to tag the vehicle accordingly and add a driver description
- Identify types of vehicles such as delivery (with the specific company), home services, etc
- Check if windows are down or if we can see a face through the windshield. If yes, save the image or clip and associate it to the vehicle.
- Additionally, you could try to use facial recognition at this point. Not likely to work well through windshield, but if window is open and you have a clear shot, you might be able to link a person to the vehicle they drive if they walk by on foot at some other time.
- Vehicle with no plate passes by: check if we are looking at the front or back of the vehicle. If back and no plate, save the image/clip and store any identifiable information about the vehicle. (General desc, stickers, driver desc if possible, stickers, damage, any other markings). By saving these things, you build a detailed knowledgebase that can help identify these vehicles if needed.
- Identify paper plates
- Vehicle behavior: Did the vehicle park? Did anyone exit the vehicle? Where did they go afterwards? This can be useful for lots of things. Ex: vehicle parks, people exit, vehicle leaves → possibly uber/rideshare. Vehicle driving really slow or doing something else strange → maybe tag to take a look at.
Quality of Life
- If you have a car with mirrors that fold when locked, have it check, either after exiting or maybe at some time in the evening, whether or not you remembered to lock the car. If the mirrors are out, maybe send you a text or notification telling you to lock the car.
- Evening before trash day and the cans are not out? The AI can remind you.
- Have a grill or outdoor furniture that should be covered before heavy rain? N8n can subscribe to a weather feed and check your cameras if rain is coming and advise you to cover the furniture.
- Has the mail come yet today? No need to go check. Add it to an automation for whatever camera looks at your mailbox and have a yes/no answer every day.
Audio analysis
- How many people are speaking? What are they saying? Are they talking normally or yelling? Referencing anything you are looking out for?
- Depends on your property what's unusual and what's not, but for me, speech audible from my yard or street late at night is unusual. If speech is detected, that's something I want transcribed and analyzed. If it's just random neighbor conversation, probably fine to ignore. Otherwise, store the transcript and clip with any keywords for review.
Reliability
- Check periodically for obstructions of view. If a camera is obstructed, how long has it been that way? What's the nature of the obstruction? Possibly notify you to resolve it.
- Secondary validation. If you are having false positive issues, you could add some secondary validation and decide what to do based on the output of that.
Highly dynamic RAG / Deep Research
By being able to access your video database, you can ask the AI basically anything and have it comb through hundreds of alerts or hours of footage for you, either looking for something specific, summarizing events, counting things, etc.
- Find some damage on your property that went unnoticed? Ask the AI to find out when it happened.
- Have a vehicle of interest from your plate database that you want to investigate? Ask the AI to go check all of the times it has passed and report on its findings. This could include the travel habits of a vehicle with extreme detail. Generate a report showing when they pass, which direction, which seem like round trips, etc. Almost scary.
- Have a person of interest or physical description? If you want to really hammer your AI, give it some criteria for what and where to search and let it watch everything back and investigate for you.
- Maybe you have a landscaper/gardener and are curious what they are actually working on while at your house. Ask and get a report of what they appeared to be doing over some period of dates.
Just for fun / Other
- If you have animals or maybe an indoor camera and dogs or cats, what did they do while you were gone? You can ask and get a report.
- Maybe you have a garden or crops that are getting munched by wildlife. Have the AI watch the alerts and tell you what's going on. You could get a notification like: "Family of 3 deer jumped garden fence and appeared to be eating your tomato plant at 2:37 am last night"
- Maybe you like planes and live near an airport or something and have a PTZ. Tap into flightradar and create a PTZ preset to snap some cool zoomed-in pictures of planes flying over.
How do you go about setting something like this up?
I'm going to play around with it and will post screenshots and guidance as I get it going. I tested several of these just to see if they would work, and all of my tests were successful. While I don't want to fragment the system into more parts, BI is somewhat limited in terms of storing richer data like this. I think it could be interesting to have a simple event and notification feed dashboard with different criteria to sort by like priority, tags, event description, faces, vehicles, etc. I'm thinking about storing some of this data in a second database. You don't want to save duplicate copies of footage that is saved in BI, but other metadata and JPEGs can be stored with minimal space, allowing you to retain a significantly longer history of events. I would guess that most people could store a full year without needing to add storage.
With regard to selection and deployment of AI models, I am looking at a hybrid approach with Qwen2VL locally, and Google Gemini in the cloud. Gemini is better, but you can't hammer it with requests the same way you could something local without running up a bill. Overall, it is actually not too expensive though and could be used on its own, depending on what your goals are. Based on my calculations, it would cost about 15 dollars/month to analyze 800 12-second 1080p clips per day.
This is not a substitute for the regular convolutional models we use with CodeProject. These models are far slower and not suitable for standard first line of detection. The idea is to use them in conjunction with your existing setup to enhance the capabilities. (They aren't slow to use, we're talking 5-60 seconds, but that's too slow for your main detection)
Something that came to mind that would be cool is an Alexa skill that allows you to ask things about your system. What happened while I was gone? When was the last time David was here? Did an exterminator come on Monday? When was the last time I got a DHL package? etc.
The other really nice thing about N8n agents is that they can be easily shared with others. They are also designed to be easy to create and use without coding. Any code required is usually short and can be generated by claude or chatgpt.
Conclusion
I'm really interested to hear what others think about this or any ideas people might have. I'm still in the tinkering phase right now, but my mind was blown thinking about all the possibilities, so I wanted to share. While not trivial to set up, it isn't crazy complicated. I think it will just be time-consuming in a similar way that setting up BI or HomeAssistant is - you are customizing everything for your needs and figuring out what works along the way.
When I have some N8n agents going, I'll share them here and make a video guide showing exactly how to set them up and customize them. Best part about it once you get going is that you can create exactly what you need or whatever you can think of.
Last edited: