Sidious Lord Sidious Lord - 6 months ago 13
Java Question

Diagnosing a performance issue

I'm not very experienced with Rust and I'm trying to diagnose a performance problem. Below there is a pretty fast Java code (runs in 7 seconds) and what I think should the the equivalent Rust code. However, the Rust code runs very slowly (yes, I compiled it with

--release
as well), and it also appears to overflow. Changing
i32
to
i64
just pushes the overflow later, but it still happens. I suspect there is some bug in what I wrote, but after staring at the problem for a long time, I decided to ask for help.

public class Blah {

static final int N = 100;
static final int K = 50;

public static void main(String[] args) {
//initialize S
int[] S = new int[N];
for (int n = 1; n <= N; n++) S[n-1] = n*n;

// compute maxsum and minsum
int maxsum = 0;
int minsum = 0;
for (int n = 0; n < K; n++) {
minsum += S[n];
maxsum += S[N-n-1];
}

// initialize x and y
int[][] x = new int[K+1][maxsum+1];
int[][] y = new int[K+1][maxsum+1];
y[0][0] = 1;

// bottom-up DP over n
for (int n = 1; n <= N; n++) {
x[0][0] = 1;
for (int k = 1; k <= K; k++) {
int e = S[n-1];
for (int s = 0; s < e; s++) x[k][s] = y[k][s];
for (int s = 0; s <= maxsum-e; s++) {
x[k][s+e] = y[k-1][s] + y[k][s+e];
}
}
int[][] t = x;
x = y;
y = t;
}

// sum of unique K-subset sums
int sum = 0;
for (int s = minsum; s <= maxsum; s++) {
if (y[K][s] == 1) sum += s;
}
System.out.println(sum);
}

}


extern crate ndarray;

use ndarray::prelude::*;
use std::mem;

fn main() {
let numbers: Vec<i32> = (1..101).map(|x| x * x).collect();

let deg: usize = 50;

let mut min_sum: usize = 0;
for i in 0..deg {
min_sum += numbers[i] as usize;
}

let mut max_sum: usize = 0;
for i in deg..numbers.len() {
max_sum += numbers[i] as usize;
}

// Make an array
let mut x = OwnedArray::from_elem((deg + 1, max_sum + 1), 0i32);
let mut y = OwnedArray::from_elem((deg + 1, max_sum + 1), 0i32);

y[(0, 0)] = 1;

for n in 1..numbers.len() + 1 {
x[(0, 0)] = 1;
println!("Completed step {} out of {}", n, numbers.len());
for k in 1..deg + 1 {
let e = numbers[n - 1] as usize;
for s in 0..e {
x[(k, s)] = y[(k, s)];
}
for s in 0..max_sum - e + 1 {
x[(k, s + e)] = y[(k - 1, s)] + y[(k, s + e)];
}
}
mem::swap(&mut x, &mut y);
}

let mut ans = 0;

for s in min_sum..max_sum + 1 {
if y[(deg, s)] == 1 {
ans += s;
}
}

println!("{}", ans);

}

Answer

To diagnose a performance issue in general, I:

  1. Get a baseline time or rate. Preferably create a testcase that only takes a few seconds, as profilers tend to slow down the system a bit. You will also want to iterate frequently.
  2. Compile in release mode with debugging symbols.
  3. Run the code in a profiler. I'm on OS X so my main choice is Instruments, but I also use valgrind.
  4. Find the hottest code path, think about why it's slow, try something, measure.

The last step is the hard part.


In your case, you have a separate implementation that you can use as your baseline. Comparing the two implementations, we can see that your data structures differ. In Java, you are building nested arrays, but in Rust you are using the ndarray crate. I know that crate has a good maintainer, but I personally don't know anything about the internals of it, or what use cases it best fits.

So I rewrote it with using the standard-library Vec.

The other thing I know is that direct array access isn't as fast as using an iterator. This is because array access needs to perform a bounds check, while iterators bake the bounds check into themselves. Many times this means using methods on Iterator.

The other change is to perform bulk data transfer when you can. Instead of copying element-by-element, move whole slices around, using methods like copy_from_slice.

With those changes the code looks like this (apologies for poor variable names, I'm sure you can come up with semantic names for them):

use std::mem;

const N: usize = 100;
const DEGREE: usize = 50;

fn main() {
    let numbers: Vec<_> = (1..N+1).map(|v| v*v).collect();

    let min_sum = numbers[..DEGREE].iter().fold(0, |a, &v| a + v as usize);
    let max_sum = numbers[DEGREE..].iter().fold(0, |a, &v| a + v as usize);

    // different data types for x and y!
    let mut x = vec![vec![0; max_sum+1]; DEGREE+1];
    let mut y = vec![vec![0; max_sum+1]; DEGREE+1];
    y[0][0] = 1;

    for &e in &numbers {
        let e2 = max_sum - e + 1;
        let e3 = e + e2;

        x[0][0] = 1;

        for k in 0..DEGREE {
            let current_x = &mut x[k+1];
            let prev_y = &y[k];
            let current_y = &y[k+1];

            // bulk copy
            current_x[0..e].copy_from_slice(&current_y[0..e]);

            // more bulk copy
            current_x[e..e3].copy_from_slice(&prev_y[0..e2]);

            // avoid array index
            for (x, y) in current_x[e..e3].iter_mut().zip(&current_y[e..e3]) {
                *x += *y;
            }
        }

        mem::swap(&mut x, &mut y);
    }

    let sum = y[DEGREE][min_sum..max_sum+1].iter().enumerate().filter(|&(_, &v)| v == 1).fold(0, |a, (i, _)| a + i + min_sum);

    println!("{}", sum);
    println!("{}", sum == 115039000);
}
  • 2.060s - Rust 1.9.0
  • 2.225s - Java 1.7.0_45-b18

On OS X 10.11.5 with a 2.3 GHz Intel Core i7.

I'm not experienced enough with Java to know what kinds of optimizations it can do automatically.

The biggest potential next step I see is to leverage SIMD instructions when performing the addition; it's pretty much exactly what SIMD is made for.


As pointed out by Eli Friedman, avoiding array indexing by zipping isn't currently the most performant way of doing this.

With the changes below, the time is now 1.267s.

let xx = &mut current_x[e..e3];
xx.copy_from_slice(&prev_y[0..e2]);

let yy = &current_y[e..e3];
for i in 0..(e3-e) {
    xx[i] += yy[i];
}