As I could understand, C++ 17 will come with Parallelism. However, what I could not understand is it a specific hardware parallelism (CPU by default)? Or it can be extended to any hardware with multiple computation units?
In other words, will we see something like,for example, "nVidia C++ standard compiler" which is going to compile the parallel parts to be executed on GPUs?
Will it be some more standardized alternative to OpenCL for example?
Note: Absolutely, I am not asking "Will nVidia do that?". I am asking if C++ 17 standards allow that and if it is theoretically possible.
The question provides a link to the paper proposing this change, and, with respect to the parallelism aspects, there haven't been substantial changes to what's proposed. Yes, the compiler can do whatever makes sense for the target hardware to parallelize the execution of various algorithms, provided only that it gets the right answer (with some reservations) and that it doesn't impose unneeded overhead (again, with some reservations).
There are a couple of important points to understand.
First, C++17 parallelism is not a general parallel programming mechanism. It provides parallel versions of many of the STL algorithms, nothing more. So it's not a replacement for more powerful mechanisms like OpenCL, TBB, etc.
Second, there are inherent limitations when you try to parallelize algorithms, and that's why I added those two parenthesized qualifications. For example, the parallel version of
std::accumulate will produce the same result as the non-parallel version only if the function being applied to the input range is commutative and associative. The most obvious problem area here is floating-point values, where math operations are not associative, so the result might differ. Similarly, some algorithms actually impose more overhead when parallelized; you get a net speedup, but there is more total work done, so the speedup for those algorithms will not be linear in the number of processing units.
std::partial_sum is an example: each output value depends on the preceding value, so it's not simple to parallelize the algorithm. There are ways to do it, but you end up applying the combiner function more times than the non-parallel algorithm would. In general, there are relaxations of the complexity requirements for algorithms in order to reflect this reality.