It's per compute unit. The local memory is used by all workgroups executed on the compute unit. One single group can't exceed this size, since it must be executed on a single compute unit.
For example, in your case, if each workgroup requires 8K of local memory, at most two workgroups can be scheduled at the same time on each compute unit.
CL_DEVICE_LOCAL_MEM_SIZE is the maximum amount of local memory available per work group. In the context of your NVIDIA card, it is the amount of on die shared memory per multiprocessor - in this case 16kb which can be consumed by one or more work groups which will run on the multiprocessor.
