Sidious Lord - 6 months ago 13

Java Question

I'm not very experienced with Rust and I'm trying to diagnose a performance problem. Below there is a pretty fast Java code (runs in 7 seconds) and what I think should the the equivalent Rust code. However, the Rust code runs very slowly (yes, I compiled it with

`--release`

`i32`

`i64`

`public class Blah {`

static final int N = 100;

static final int K = 50;

public static void main(String[] args) {

//initialize S

int[] S = new int[N];

for (int n = 1; n <= N; n++) S[n-1] = n*n;

// compute maxsum and minsum

int maxsum = 0;

int minsum = 0;

for (int n = 0; n < K; n++) {

minsum += S[n];

maxsum += S[N-n-1];

}

// initialize x and y

int[][] x = new int[K+1][maxsum+1];

int[][] y = new int[K+1][maxsum+1];

y[0][0] = 1;

// bottom-up DP over n

for (int n = 1; n <= N; n++) {

x[0][0] = 1;

for (int k = 1; k <= K; k++) {

int e = S[n-1];

for (int s = 0; s < e; s++) x[k][s] = y[k][s];

for (int s = 0; s <= maxsum-e; s++) {

x[k][s+e] = y[k-1][s] + y[k][s+e];

}

}

int[][] t = x;

x = y;

y = t;

}

// sum of unique K-subset sums

int sum = 0;

for (int s = minsum; s <= maxsum; s++) {

if (y[K][s] == 1) sum += s;

}

System.out.println(sum);

}

}

`extern crate ndarray;`

use ndarray::prelude::*;

use std::mem;

fn main() {

let numbers: Vec<i32> = (1..101).map(|x| x * x).collect();

let deg: usize = 50;

let mut min_sum: usize = 0;

for i in 0..deg {

min_sum += numbers[i] as usize;

}

let mut max_sum: usize = 0;

for i in deg..numbers.len() {

max_sum += numbers[i] as usize;

}

// Make an array

let mut x = OwnedArray::from_elem((deg + 1, max_sum + 1), 0i32);

let mut y = OwnedArray::from_elem((deg + 1, max_sum + 1), 0i32);

y[(0, 0)] = 1;

for n in 1..numbers.len() + 1 {

x[(0, 0)] = 1;

println!("Completed step {} out of {}", n, numbers.len());

for k in 1..deg + 1 {

let e = numbers[n - 1] as usize;

for s in 0..e {

x[(k, s)] = y[(k, s)];

}

for s in 0..max_sum - e + 1 {

x[(k, s + e)] = y[(k - 1, s)] + y[(k, s + e)];

}

}

mem::swap(&mut x, &mut y);

}

let mut ans = 0;

for s in min_sum..max_sum + 1 {

if y[(deg, s)] == 1 {

ans += s;

}

}

println!("{}", ans);

}

Answer

To diagnose a performance issue in general, I:

- Get a baseline time or rate. Preferably create a testcase that only takes a few seconds, as profilers tend to slow down the system a bit. You will also want to iterate frequently.
- Compile in release mode with debugging symbols.
- Run the code in a profiler. I'm on OS X so my main choice is Instruments, but I also use valgrind.
- Find the hottest code path, think about why it's slow, try something, measure.

The last step is the hard part.

In your case, you have a separate implementation that you can use as your baseline. Comparing the two implementations, we can see that **your data structures differ**. In Java, you are building nested arrays, but in Rust you are using the ndarray crate. I know that crate has a good maintainer, but I personally don't know anything about the internals of it, or what use cases it best fits.

So I rewrote it with using the standard-library `Vec`

.

The other thing I know is that direct array access isn't as fast as using an iterator. This is because array access needs to perform a bounds check, while iterators bake the bounds check into themselves. Many times this means using methods on `Iterator`

.

The other change is to perform bulk data transfer when you can. Instead of copying element-by-element, move whole slices around, using methods like `copy_from_slice`

.

With those changes the code looks like this (apologies for poor variable names, I'm sure you can come up with semantic names for them):

```
use std::mem;
const N: usize = 100;
const DEGREE: usize = 50;
fn main() {
let numbers: Vec<_> = (1..N+1).map(|v| v*v).collect();
let min_sum = numbers[..DEGREE].iter().fold(0, |a, &v| a + v as usize);
let max_sum = numbers[DEGREE..].iter().fold(0, |a, &v| a + v as usize);
// different data types for x and y!
let mut x = vec![vec![0; max_sum+1]; DEGREE+1];
let mut y = vec![vec![0; max_sum+1]; DEGREE+1];
y[0][0] = 1;
for &e in &numbers {
let e2 = max_sum - e + 1;
let e3 = e + e2;
x[0][0] = 1;
for k in 0..DEGREE {
let current_x = &mut x[k+1];
let prev_y = &y[k];
let current_y = &y[k+1];
// bulk copy
current_x[0..e].copy_from_slice(¤t_y[0..e]);
// more bulk copy
current_x[e..e3].copy_from_slice(&prev_y[0..e2]);
// avoid array index
for (x, y) in current_x[e..e3].iter_mut().zip(¤t_y[e..e3]) {
*x += *y;
}
}
mem::swap(&mut x, &mut y);
}
let sum = y[DEGREE][min_sum..max_sum+1].iter().enumerate().filter(|&(_, &v)| v == 1).fold(0, |a, (i, _)| a + i + min_sum);
println!("{}", sum);
println!("{}", sum == 115039000);
}
```

- 2.060s - Rust 1.9.0
- 2.225s - Java 1.7.0_45-b18

On OS X 10.11.5 with a 2.3 GHz Intel Core i7.

I'm not experienced enough with Java to know what kinds of optimizations it can do automatically.

The biggest potential next step I see is to leverage SIMD instructions when performing the addition; it's pretty much exactly what SIMD is made for.

As pointed out by Eli Friedman, avoiding array indexing by zipping *isn't* currently the most performant way of doing this.

With the changes below, the time is now **1.267s**.

```
let xx = &mut current_x[e..e3];
xx.copy_from_slice(&prev_y[0..e2]);
let yy = ¤t_y[e..e3];
for i in 0..(e3-e) {
xx[i] += yy[i];
}
```