Tsubasa Kato Tsubasa Kato - 1 year ago 69
PHP Question

Issuing update to Solr makes Duplicates

I just started using Solr 6.0 the other day, and had a script made for updating Solr index via a php script using curl.
But right now, there are duplicate entries when updated with the php script below.

The schema right now is like this: id (unique key field), url, keywords, description, title.

Is this because I did not specify a explicit unique key field on url with the schema?

I would like to have the url as a unique key so it will prevent from Solr to index duplicates upon update, and overwrite if it is a duplicate. How do you do this?

// apt-get install php5 libapache2-mod-php5 php5-curl

// curl 'http://localhost:8983/solr/update/csv?fieldnames=url,keywords,description,title&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary @$file

$CORE = 'core1';
$callback = &$_REQUEST['fd-callback'];
$url = 'http://'. $SOLR_SERVER .':8983/solr/'. $CORE .'/update/csv?fieldnames=url,keywords,description,title&commit=true';

if (!empty($_FILES['fd-file']) and is_uploaded_file($_FILES['fd-file']['tmp_name'])) {
$name = $_FILES['fd-file']['name'];
$data = file_get_contents($_FILES['fd-file']['tmp_name']);
} else {
$name = urldecode(@$_SERVER['HTTP_X_FILE_NAME']);
$data = file_get_contents("php://input");

$header = array("Content-type:text/csv; charset=utf-8");
$post = $data;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
//curl_setopt($ch, CURLOPT_VERBOSE, TRUE);
if ($ch) {
$output = 'Upload Success!';
}else {
$output = 'Upload did not work!';

// $opt = &$_REQUEST['upload_option'];
// isset($opt) and $output .= "\nReceived upload_option with value $opt";
if ($callback) {
header('Content-Type: text/html; charset=utf-8');
$output = addcslashes($output, "\\\"\0..\x1F");
echo '<!DOCTYPE html><html><head></head><body><script type="text/javascript">',
} else {
header('Content-Type: text/plain; charset=utf-8');
echo $output;


Answer Source

If you want the URL to identify a document uniquely, define the url field as your uniqueKey. If you have an id field defined as your uniqueKey, but submit identical URLs with different ids, Solr cannot know that these documents are the same object.

Another option is to use the id field to actually reference a unique URL, either as hash or from your DB.