DeepStack Case Study: Performance from CPU to GPU version

jaydeel

BIT Beta Team
Joined
Nov 9, 2016
Messages
454
Reaction score
384
Location
SF Bay Area
I'm posting this thread for anyone considering making the transition from the DeepStack CPU version to the GPU version.

Hopefully this will give you an impression of what you can expect in terms of enhanced performance.

Please note some of my observations are system-specific. My system specs:
  • I7-4770 processor
  • 16 GB RAM
  • 500 GB SSD
  • 8 TB Purple hard drive
  • PNY NVIDIA Quadro P400 V2
  • Headless
  • 9 x 2MP cameras, all continuously dual streaming
  • 5/9 cameras using DeepStack (all Dahua and triggered via ONVIF using IVS tripwires)
  • EDIT: DeepStack default object detection only - no face detection, no custom models
  • This server is used only for Blue Iris AND a PHP server (used only for home automation)

The next 6 screenshots show DeepStack processing times data for 5 cameras over a period of 3 weeks. Two images are shown for each week:
  • the 1st (left) shows all the data (full-scale);
  • the 2nd (right) shows a subset of the data on an expanded scale (0 to 600 msec).

1) Period 1: the week before the P400 was installed.
1630362810899.png1630363051949.png

2) Period 2: the week during which the P400 and DS GPU version were installed (this event took place before midnight on day 2)
1630363084298.png 1630363100359.png

3) Period 3: the week after the P400 card was installed.
1630363201854.png 1630363223829.png

The next 2 images compare the Deepstack processing time statistics for Periods 1 & 3 (7 days on the CPU version vs 7 days on the GPU version). Please note that the statistical analyses excludes the long-duration event 'outliers' and use only the data points between 0 and 1000 msec. This approach was taken to provide a 'cleaner', more meaningful comparison when the system is running 'normally' (not stressed).

1630363547845.png 1630363611069.png

This last screenshot is a mark-up of the expanded scale data for Period 2.
1630363940240.png

Observations:
  1. Overall: as expected, the GPU version yields much faster and less noisy results... and the frequency and severity of very long events is greatly reduced.

  2. The worst processing times for the GPU version are rarely slower than the best processing times for the CPU version.

  3. Statistics: The GPU version is ~3.4X faster (139 msec mean vs 469 msec) and ~2.8X less disperse (43 msec stdev vs 122 msec).

    Note also that both Periods 1 & 3 contain a similar number of events (1030 vs 1075) and a similar ratio of confirmed: total events (0.41 vs 0.38). The later is consistent with the motion detection schemes being unchanged over the 3 week duration of this experiment.

  4. DeepStack 'Confirmed' events have the same statistics as 'Cancelled' events, regardless of the version. I'm not sure I expected this for the CPU version, but the data is convincing.

  5. Using the settings 'Use main stream if available' and 'Save DeepStack analysis details' increased the GPU version processing time by roughly 20%. (Please note that I conducted this experiment for a little over a day only, and I have not yet performed an independent evaluation of the two settings, so one of them may be dominating the apparent difference. If so, my bet is on the former.)
Please note that observations 2, 3 & 5 are applicable to my system only. If you have a less powerful CPU, the ratios should be higher, Conversely, if you have a more powerful CPU, you may need a better NVIDIA card to observe similar improvement.
 
Last edited:

jaydeel

BIT Beta Team
Joined
Nov 9, 2016
Messages
454
Reaction score
384
Location
SF Bay Area
I'm reposting 3 images; this time they all have the same y-axis range (0-1000 msec).

The stats in the previous post apply to screenshots 1 & 3.

1) Period 1: DeepStack CPU version performance (1,030 events)
1630388809349.png

2) Period 2: Transition (1,162 events)
1630389236473.png

3) Period 3: DeepStack GPU version performance (1,075 events)
1630389270455.png
 
Last edited:
Joined
Dec 28, 2019
Messages
6,204
Reaction score
12,940
Location
New Jersey
Step 1 - install Nvidia CUDA capable card, preferably one with a large number of CUDA cores.
Step 2 - Follow the installation instructions for the GPU version as posted on the DS forum/page. You can skip the last step in those instructions regarding Visual Studio.
Step 3 - You're good to go.
 

kc8tmv

Getting the hang of it
Joined
May 27, 2017
Messages
124
Reaction score
60
Location
Cincinnati, Ohio
Step 1 - install Nvidia CUDA capable card, preferably one with a large number of CUDA cores.
Step 2 - Follow the installation instructions for the GPU version as posted on the DS forum/page. You can skip the last step in those instructions regarding Visual Studio.
Step 3 - You're good to go.
No uninstalling of the CPU version?
 

jaydeel

BIT Beta Team
Joined
Nov 9, 2016
Messages
454
Reaction score
384
Location
SF Bay Area
You would not happen to have a "step by step / how to" for moving from the CPU version to the GPU version would you?
Same as @sebastiantombs said... I kept notes on the links so I'll add those:
  1. Using DeepStack with Windows 10 (CPU and GPU) | DeepStack v1.2.1 documentation
  2. CUDA Toolkit 10.1 original Archive
  3. Installation Guide :: NVIDIA Deep Learning cuDNN Documentation
Note: #3 has a prerequisite -- you must register for the NVIDIA Developer Program. I did not understand all the jargon, but I managed nonetheless.

As for uninstalling the CPU version first, I cannot recall explicitly doing this. The GPU package installer may have taken care of this.

A few more details...
I started the conversion at about 9:30p and was done by 11:00 pm. This included dragging the PC out of the closet, attaching a USB keyboard/mouse, installing the P400 card, attaching a monitor, then (arghh) futzing around getting the display to work (because I had the PC setup to use an extended dual display and attached display was the #2 monitor vs the HDMI plug as the #1 monitor). I think I spent no more than 40 minutes installing the GPU version and getting it set up in Blue Iris. Quite surprising to me, everything worked the first time! I hit the sack at 11:30.
 

jaydeel

BIT Beta Team
Joined
Nov 9, 2016
Messages
454
Reaction score
384
Location
SF Bay Area
One last chart...

This chart shows CPU usage before and after enabling the settings 'Use main stream if available' and 'Save DeepStack analysis details'
Both settings were disabled before the leftmost vertical purple line; ditto after the rightmost purple line.
These data were collected every 10 minutes. The chart has 675 data points.

1630446154695.png

Observations:
  1. Enabling 'Use main stream if available' on 5/5 DeepStack-enabled cameras increased the CPU usage just a few percent.

  2. Also enabling 'Save DeepStack analysis details' on 5/5 DeepStack-enabled cameras increased the CPU usage from ~20% to ~35%.

    Bottom line... continuously saving DeepStack analysis details has a measurable impact, but perhaps not a huge one If you've got CPU cycles to spare,
 
Last edited:

samplenhold

Known around here
Joined
Aug 8, 2018
Messages
3,347
Reaction score
8,590
Location
Spring, Texas
Quick question: If your MB/CPU combination has the built in Graphics, and you also install a separate graphics card, can you use both? Like for my system, I have many cams using the NVIDIA NVDEC for HA. But I could not get all cams to use that. There seems to be a number of cams limitation. So the rest are using Intel for HA. Is there a way to get the onboard GPU to also be used by BI for HA? Can you specify which GPU to use for DeepStack? What if I added a second graphics card? Could that be used also?

1630452240752.png
 

wittaj

Known around here
Joined
Apr 28, 2019
Messages
6,057
Reaction score
8,514
Location
USA
Yep, you can go into each individual camera and select the GPU number and the type. You can use multiple graphic cards if you have them.

You can install the GPU version of Deepstack instead of Windows.
 

kklee

Pulling my weight
Joined
May 9, 2020
Messages
146
Reaction score
151
Location
Vancouver, BC
That is quite an improvement from a very low-end GPU.
I'm running the exact same video card and have similar results. It's reasonably priced and much low power consumption compared to mainstream Gaming cards.
 

wittaj

Known around here
Joined
Apr 28, 2019
Messages
6,057
Reaction score
8,514
Location
USA
So, I wanted to get a GPU to offload OpenALPR to it. The documentation says it can, but I couldn't get it to work and OpenALPR took a look at it and couldn't figure out what to do other than get a bigger computer and GPU card LOL.

So before I take it back, I thought I would try it with Blue Iris and DeepStack.

I am seeing similar improvements that was posted here and what @sebastiantombs and others had indicated elsewhere. The GPU is looking to be about 8 times faster than the CPU version.
 

CCTVCam

Known around here
Joined
Sep 25, 2017
Messages
1,274
Reaction score
1,191
Just a quick question, what's the difference between a Quadro and a gaming card with the same number of CUDA cores? I see a gaming card for 1/2 the price of the above Quadro with 384 CUDA cores.

Will it be faster having more CUDA cores and more memory or is there another factor?
 

whoami ™

Pulling my weight
Joined
Aug 4, 2019
Messages
148
Reaction score
118
Location
South Florida
Nvidia hand picks the best silicon for Quadro. So the chips for gaming cards are from the same batch. Quadro are enterprise grade cards and come with different memory timings and clock speeds. Quadro cards allow for things like VM GPU passthrough while consumer gaming cards have driver limitations placed on them and require a hack.

applications can be core heavy or memory intensive. so depends on the application. core clock and memory timings are also a factor so cards with the same resources aren't necessary equal.

Deepstack is not a heavy workload. You would need to run multiple instances of DS to compute in parallel to attempt to place a heavy load on a Quadro P400. If you were using something like a old GTX1070 it'd be such overkill you wouldn't even realize it was working.

If you will also want the card to decode video on more than 4-6 cams memory will be the limiting factor. i would be looking at the Quadro T600 with 4 GB GDDR6 @ 160 GB/s then.
 
Last edited:
Top