Samyak Jain Samyak Jain - 28 days ago 20
Python Question

Flickr API returns duplicate photos while extracting all geotagged photos

I'm trying to extract all the geotagged photos from Flickr using the Flickr API method

flickr.photos.search()
. Here is the code:

import flickr_api
import urllib2
from flickr_api.api import flickr

flickr_api.set_keys(api_key = 'my_api_key', api_secret = 'my_api_secret')
flickr_api.set_auth_handler("AuthToken")

for i in range(1, 1700):
photo_list = flickr.photos.search(api_key='my_api_key', has_geo=1, extras='description,license,geo,tags,machine_tags', per_page=250, page=i, min_upload_date='972518400', accuracy=12)
f = open('xmldata1/photodata' + str(i) + '.xml','w')
f.write(photo_list)
f.close()


This script runs to give me an xml file for each page of the data. Each xml file has 250 photos data. There are 1699 such xml files. I get approximately 420,000 photos data with a lot of duplicates. After removing the duplicates, I got only 9022 unique images.

I have read here that it is safe to query for 16 pages = 4000 images at once to avoid duplicates.

I want to avoid duplicate images as much as possible and I require 100,000+ unique geotagged images for gps clustering purpose.

What time lag should I insert between two instances of the query?
If I must consider another approach, please elaborate on it.

Let me know if you have any queries. Any help would be appreciated!

Answer

Try using a max_upload_date along with the min_upload_date. Keep a time frame of a couple of days and keep shifting the time frame from the min_upload_date to the max_upload_date. Search for photos within that time frame only.

Comments