Brute force is to bin (order n) then efficiently sort the bins with an n log n sort. This can even be done in parallel with parallel primitives (and runs fast on a GPU too). At that point you have to sort particles within bins and taking into account neighboring bins.
This is not optimal if you need to resort every timestep, particles only move one bin per timestep (most particle based codes) and or if the there is large range density range (in which case you may be sorting a large number of particles within a single bin).
When particles only move a small amount and you have a presorted system then an optimal resort is the way to go and I think this would be order N per spatial dimension.
If there is a wide range of density in your particle distribution you might consider a octal tree algorithm which is also N log N to make but would give you a nearest neighbor list based on the tree itself. There are efficient ways of fixing or updating trees as a particles system evolves. Trees can be constructed in parallel but I am not convinced they are very efficient yet, I think they use a Morton/hash to bin that also gives an octal tree structure and then they define tree nodes based on the bit changes in the bin ids. I have only seen implementations that go a few nodes deep in the tree, so there are ways to improve the parallel algorithms for this setting.
If you use the Morton hash to make your bins then the resulting sort can be used to count numbers of particles in bins, and bins twice or four times or 8 times as big. This is kind of like using the octal tree. Binning and sorting is pretty fast so you can do it a bunch of times depending upon what range of density you expect for your particle distribution.
Other ideas include using space filling curves to hash and other types of trees. The KD tree is popular but the parallel implementations of it I have seen have not been particularly clean or pretty.