Python Question

I am merging 2 dataframes together. They are originally

`.csv`

`merge=pd.merge(df1,df2, on=['POINTID'], how='outer')`

and in 32-bit Anaconda I get:

`ValueError: negative dimensions are not allowed`

but on 64-bit Anaconda I get a memory error.

I have 12 gigabytes of RAM and only 30% of it is being used so it should not be a memory issue. I tried on another computer and get the same issue.

Answer

On a 32-bit machine, the default NumPy integer dtype is `int32`

.
On a 64-bit machine, the default NumPy integer dtype is `int64`

.

The largest integers representable by an `int32`

and `int64`

are:

```
In [88]: np.iinfo('int32').max
Out[88]: 2147483647
In [87]: np.iinfo('int64').max
Out[87]: 9223372036854775807
```

So the integer index created by `pd.merge`

will support a maximum of `2147483647 = 2**31-1`

rows on a 32-bit machine, and `9223372036854775807 = 2**63-1`

rows on a 64-bit machine.

In theory, two 290000 row DataFrames merged with an `outer`

join may have as many as `290000**2 = 84100000000`

rows. Since

```
In [89]: 290000**2 > np.iinfo('int32').max
Out[89]: True
```

the 32-bit machine may not be able to generate an integer index big enough to index the merged result.

And although the 64-bit machine can in theory generate an integer index big enough to accommodate the result, you may not have enough memory to build a 84 billion-row DataFrame.

Now, of course, the merged DataFrame may have fewer than 84 billion rows (the exact number depends on how many duplicate values appear in `df1['POINTID']`

and `df2['POINTID']`

) but the above back-of-the envelope calculation shows that the behavior you are seeing is consistent with having a lot of duplicates.

PS. You can get negative values when adding or multiplying positive integers in NumPy arrays if there is arithmetic overflow:

```
In [92]: np.int32(290000)*np.int32(290000)
Out[92]: -1799345920
```

My guess is that this is the reason for the exception:

```
ValueError: negative dimensions are not allowed
```