There is plenty in this article that is just wrong. 1. GNNs are no more "sequent...

There is plenty in this article that is just wrong.

1. GNNs are no more "sequential" than CNNs and are therefore just as parallelizable in this respect (caveat below). A single GNN layer simply aggregates the features of the connected neighbors, just as a CNN aggregates the values of a nearby pixel. This can be parallelized across the nodes/pixels. The next layer depends on the output of the previous layer and is sequential in that sense, but that's true of all forms of neural networks. If other architectures have "won the hardware library" relative to GNNs, it's because GNNs depend heavily on sparse*dense multiplication. The real thing that makes it hard to parallelize is that you have to partition the graph intelligently when splitting across machines because there's a computational dependence between nodes, and you don't want connected nodes to be on different devices. In the metaphor with CNNs, that would be like needing to split a single image across multiple machines and still carry out the convolution operation.

2. It's not true that pre-training doesn't work. It's very common to use unsupervised/self-supervised pre-training to e.g., get node embeddings, which are then fine-tuned on a down-stream task.

3. It's true that the naive application of deep GNN architectures leads to problems like over-smoothing and the information bottleneck, but there are known solutions to each, and it's just rarely the case that you reasonably want/need information from far away in the graph except in special applications. In those cases, you likely want a different graph representation of the data rather than the perhaps obvious one.

4. It's true that GNNs improperly applied to problems, whether choosing the wrong graph representation, pathological architecture, or simply a problem that doesn't have dependence between the data points, will have poor performance. But I don't think that's surprising and I'm sure that there are many problems where simply throwing a CNN at the data doesn't help as well. Obviously, the modeling approach needs to fit the inductive priors of the problem.