- With Nx Servings, you can easily batch multiple inference requests.
- This is very helpful because of a lot of reasons. One of the main one being that there’s a lot of overhead in transferring the data from CPU to GPU. CPU & GPU have separate RAM. A variable you store in CPU memory cannot be directly accessed in GPU kernel. We need to copy it there. With batching, you dont have to do the copying again & again!
Â