juliana juliana - 2 months ago 13
Java Question


i run this code in power shell by following the steps and commands for pwer shell in this tutorial.
i just change the name from WordCount to Matrix.
all the steps work fine, But i get this error after run the Azure PowerShell script:

exception in thread main org.apache.hadoop.mapred.lip.input.invalidInputException:input path does not exist

The code

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class OneStepMatrixMultiplication {

public static class Map extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString();
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("A")) {
for (int k = 0; k < p; k++) {
outputKey.set(indicesAndValue[1] + "," + k);
outputValue.set("A," + indicesAndValue[2] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
} else {
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("B," + indicesAndValue[1] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);

public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String[] value;
HashMap<Integer, Float> hashA = new HashMap<Integer, Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer, Float>();
for (Text val : values) {
value = val.toString().split(",");
if (value[0].equals("A")) {
hashA.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
} else {
hashB.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
int n = Integer.parseInt(context.getConfiguration().get("n"));
float result = 0.0f;
float a_ij;
float b_jk;
for (int j = 0; j < n; j++) {
a_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f;
b_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f;
result += a_ij * b_jk;
if (result != 0.0f) {
context.write(null, new Text(key.toString() + "," + Float.toString(result)));

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// A is an m-by-n matrix; B is an n-by-p matrix.
conf.set("m", "2");
conf.set("n", "5");
conf.set("p", "3");

Job job = new Job(conf, "MatrixMatrixMultiplicationOneStep");



FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

Th script file code

# The Storage account and the HDInsight cluster variables
$subscriptionName = "<AzureSubscriptionName>"
$stringPrefix = "<StringForPrefix>"
$location = "<MicrosoftDataCenter>" ### Must match the data Storage account location
$clusterNodes = <NumberOFNodesInTheCluster>

$storageAccountName_Data = "<TheDataStorageAccountName>"
$containerName_Data = "<TheDataBlobStorageContainerName>"

$clusterName = $stringPrefix + "hdicluster"

$storageAccountName_Default = $stringPrefix + "hdistore"
$containerName_Default = $stringPrefix + "hdicluster"

# The MapReduce job variables
$jarFile = "wasb://$containerName_Data@$storageAccountName_Data.blob.core.windows.net/WordCount/jars/WordCount.jar"
$className = "org.apache.hadoop.examples.WordCount"
$mrInput = "wasb://$containerName_Data@$storageAccountName_Data.blob.core.windows.net/WordCount/Input/"
$mrOutput = "wasb://$containerName_Data@$storageAccountName_Data.blob.core.windows.net/WordCount/Output/"
$mrStatusOutput = "wasb://$containerName_Data@$storageAccountName_Data.blob.core.windows.net/WordCount/MRStatusOutput/"

# Create a PSCredential object. The user name and password are hardcoded here. You can change them if you want.
$password = ConvertTo-SecureString "Pass@word1" -AsPlainText -Force
$creds = New-Object System.Management.Automation.PSCredential ("Admin", $password)

Select-AzureSubscription $subscriptionName

# Create a Storage account used as the default file system
Write-Host "Create a storage account" -ForegroundColor Green
New-AzureStorageAccount -StorageAccountName $storageAccountName_Default -location $location

# Create a Blob storage container used as the default file system
Write-Host "Create a Blob storage container" -ForegroundColor Green
$storageAccountKey_Default = Get-AzureStorageKey $storageAccountName_Default | %{ $_.Primary }
$destContext = New-AzureStorageContext –StorageAccountName $storageAccountName_Default –StorageAccountKey $storageAccountKey_Default

New-AzureStorageContainer -Name $containerName_Default -Context $destContext

# Create an HDInsight cluster
Write-Host "Create an HDInsight cluster" -ForegroundColor Green
$storageAccountKey_Data = Get-AzureStorageKey $storageAccountName_Data | %{ $_.Primary }

$config = New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes |
Set-AzureHDInsightDefaultStorage -StorageAccountName "$storageAccountName_Default.blob.core.windows.net" -StorageAccountKey $storageAccountKey_Default -StorageContainerName $containerName_Default |
Add-AzureHDInsightStorage -StorageAccountName "$storageAccountName_Data.blob.core.windows.net" -StorageAccountKey $storageAccountKey_Data

New-AzureHDInsightCluster -Name $clusterName -Location $location -Credential $creds -Config $config

# Create a MapReduce job definition
Write-Host "Create a MapReduce job definition" -ForegroundColor Green
$mrJobDef = New-AzureHDInsightMapReduceJobDefinition -JobName mrWordCountJob -JarFile $jarFile -ClassName $className -Arguments $mrInput, $mrOutput -StatusFolder /WordCountStatus

# Run the MapReduce job
Write-Host "Run the MapReduce job" -ForegroundColor Green
$mrJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $mrJobDef
Wait-AzureHDInsightJob -Job $mrJob -WaitTimeoutInSeconds 3600

Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $mrJob.JobId -StandardError
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $mrJob.JobId -StandardOutput

# Delete the HDInsight cluster
Write-Host "Delete the HDInsight cluster" -ForegroundColor Green
Remove-AzureHDInsightCluster -Name $clusterName

# Delete the default file system Storage account
Write-Host "Delete the storage account" -ForegroundColor Green
Remove-AzureStorageAccount -StorageAccountName $storageAccountName_Default


Based on my understanding, I think you want to calculate for matrix multiplication in Azure HDInsight. And you could ran your code in HDInsight Emulator successfully, but failed in HDInsigit on Azure.

The file path on HDFS of Azure HDInsight is directly use the relative path based on the blob container as root path without host information if you remote into the cluster, such as wasb:///examples/data/....

So you can try to remote into the HDInsight Cluster and run the code in the remote ssh for Linux or cmd for Windows, and follow the steps below.

  1. Copy your mapreduce jar file and data file into HDInsight Cluster. For example of Hadoop on Linux, you can command scp <your-file> <ssh-username>@<hdcluster-name>-ssh.azurehdinsight.net:/home/<hdcluster-username>/.
  2. Make a directory in HDInsight Filesystem, command hadoop fs -mkdir wasb:///<dir-name>/.
  3. Copy your mapreduce jar file into hadoop fs -cp <your jar file>wasb:///<dir-name>/jars/ like the default examples on HDInsight.

Or you can refer to https://azure.microsoft.com/en-us/documentation/articles/hdinsight-upload-data/ to upload files into HDInsight instead of the three steps above.

  1. Copy your data file into hadoop fs -cp <your data file> wasb:///<dir-name>/data/input/ like the default examples on HDInsight.
  2. Command hadoop jar wasb:///<dir-name>/jars/<your jar file name>.jar <your class name> wasb:///<dir-name>/data/input/<your data file> wasbL///<dir-name>/data/output to run your code
  3. Waiting for the job completed, then command hadoop fs -cat wasb:///<dir-name>/data/output/* to show the result.

If the HDInsight Cluster created on Linux, you can refer to https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-mapreduce-ssh/ and find the ssh login information on Azure new portal, see the picture below.

enter image description here

If the HDInsight Cluster created on Windows, you can refer to https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-mapreduce-remote-desktop/ and find the Remote Desktop Information as the picture above that the Remote Desktop instead of the Secure Shell.

If you want to see the result of your code, you can also find it on Azure new portal, see the pictures below.

enter image description here

enter image description here