3 hours ago, _Flame_ said:
I have tried to have more triangles per quad but it made it worse.
Artificially adding triangles that are not needed to model the required shape is definitely a bad idea. What is meant by the statement about low polycounts, is that performance gains relative to other techniques go down as polycount decreases -- not absolute performance. So instancing with only two triangles per quad should not be slower than using one thousand triangles per quad. It's just that compared to rendering the same geometry without instancing -- so "duplicating" vertex buffers, or at least use more draw calls -- the relative gains will be bigger if your instances have more triangles.
3 hours ago, _Flame_ said:
I would be relieved if problem is just gpu performance.
In the comments under the tutorial from your original post some people shared their framerates for the demo. Maybe that can give an indication. I'm assuming your gpu is a mobile one (you can switch between integrated and discrete, and it has an 'M' in its name ?), so I guess it's kindof expected that it performs a bit less than a desktop gpu of the same generation...
Maybe the way you update the billboard transformations each frame is not optimal? Look here for more info.