The fastest solution I can think of is to not use arrays in the first place: use individual variables instead and use some sort of access function to access them as if they were an array. IIRC (at least for the AMD compiler but I'm pretty sure this is true for NVidia as well): generally, arrays are always stored in memory, while scalars may be stored in registers. (But my mind is a little fuzzy on the matter — I might be wrong!).
As rtollert stated it is up to the implementation to decide if LUT is placed in registers or into global memory. Normally arrays in a kernel are a no-no but since it is small it's hard to say where it will be placed. Assuming that LUT is placed into registers I would say the reason it's taking a long time compared to a simple arithmetic operation isn't because it's accessed randomly but because each work item makes an additional 8(Edit: apparently a lot more) global reads of X to calculate the LUT index.
Depending on what's omitted could you do something like Yi*gsi+gid = global_tablesomeIndex + Xi*gsi+gid;?
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.