Mincong Huang Mincong Huang - 7 months ago 14
SQL Question

How to import aws public dataset into mysql instance?

I'm trying to use a public dataset in Amazon AWS, called Twilio/Wigle.net Street Vector Data Set. This dataset contains data of US street names and address ranges. Its size is about 10 GiB. When we take a look in linux, it looks like the following :

ubuntu@ip-172-31-xxx-xxx:/data-us-street$ pwd
/data-us-street
ubuntu@ip-172-31-xxx-xxx:/data-us-street$ ll
total 20576
drwxr-xr-x 6 27 sudo 4096 May 19 2009 ./
drwxr-xr-x 23 root root 4096 Apr 20 18:10 ../
-rw-r--r-- 1 root root 8339 May 19 2009 README
drwx------ 2 27 sudo 4096 Mar 18 2009 addresses/
-rw-rw---- 1 27 sudo 5242880 Mar 18 2009 ib_logfile0
-rw-rw---- 1 27 sudo 5242880 Mar 8 2009 ib_logfile1
-rw-rw---- 1 27 sudo 10485760 Mar 18 2009 ibdata1
drwx------ 2 27 sudo 16384 Mar 8 2009 lost+found/
drwx------ 2 27 sudo 4096 Mar 8 2009 mysql/
-rw-rw---- 1 27 sudo 117 Mar 18 2009 mysql-bin.000001
-rw-rw---- 1 27 sudo 19 Mar 18 2009 mysql-bin.index
drwx------ 2 27 sudo 4096 Mar 8 2009 test/


In order to use its data, I want to link it to a MySQL database in the same host. Can somebody tell me how to do it ?




What I've tried



I've tried to overwrite the mysql storage directory
datadir
located at
/etc/mysql/my.cnf
with the following change :

#datadir = /var/lib/mysql
datadir = /data-us-street


I stopped the server, changed to value then restarted the MySQL server. However, it doesn't work.




README

/*
Copyright (c) 2009 Twilio, Inc.

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
*/

1. The Twilio/Wigle.net public address dataset

This public dataset contains the street data for the U.S., based on
the geodata published by the U.S. Census Bureau's TIGER project.
We have reformated the data from GIS friendly shapefiles to a more
generally accessible MySQL database.

This dataset covers all the streets, roads, and highways in the U.S.
These streets are represented as "polylines" which are shapes made up
of individual line segments. Each polyline has its own unique ID,
and each segment that makes up the polyline has a sequence number.
The combination of this ID and sequence number is the primary key of
the address table. Each row in the database represents one of these
line segments.

2. Description of Columns

+-------------+------------+------+-----+------------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+------------+------+-----+------------+-------+
| id | char(10) | NO | PRI | | |
| seq | char(3) | NO | PRI | | |
| name | char(30) | YES | MUL | NULL | |
| prefix | char(2) | YES | MUL | NULL | |
| type | char(4) | YES | | NULL | |
| startlat | float(12,8)| NO | | 0.00000000 | |
| startlong | float(12,8)| NO | | 0.00000000 | |
| endlat | float(12,8)| NO | | 0.00000000 | |
| endlong | float(12,8)| NO | | 0.00000000 | |
| leftzip | int(5) | YES | MUL | NULL | |
| rightzip | int(5) | YES | MUL | NULL | |
| leftaddr1 | char(11) | YES | | NULL | |
| leftaddr2 | char(11) | YES | | NULL | |
| rightaddr1 | char(11) | YES | | NULL | |
| rightaddr2 | char(11) | YES | | NULL | |
| name_dtmf | char(30) | YES | | NULL | |
| prefix_dtmf | char(2) | YES | | NULL | |
+-------------+------------+------+-----+------------+-------+

id - this is the unique ID of a street, road, or highway polyline,
according to US census blocks

seq - this is the sequence number of this segment in the polyline,
used to correlate 2 segments of the same street within a US
census block

name - the name of this street, road, or highway. example: Main

prefix - the prefix of the street name. examples: N,S,E,W

type - the suffix of the street name. examples: Blvd, St, Rd

startlat/startlong - the lat/long pair of this segments the starting
point

endlat/endlong - the lat/long pair of this segments the ending point

leftzip/rightzip - the zipcode of the addresses on the corresponding
side of the street

leftaddr1/leftaddr2/rightaddr1/rightaddr2 - the starting and ending
addresses
numbers for each side of the street

name_dtmf/prefix_dtmf/type_dtmf - the name, prefix and type columns,
represented
as DTMF encoded numbers. For example, (A,B,C) = 2, (D,E,F) = 3,
etc.
Useful for telephony applications.


3. Example Segment

--------------------------------------------------------
id: 111710515
seq: 0
name: Main
prefix:
type: St
startlat: 41.49493408
startlong: -87.70324707
endlat: 41.49483490
endlong: -87.70324707
leftzip: 60466
rightzip: 60443
leftaddr1: 21801
leftaddr2: 21805
rightaddr1: 21800
rightaddr2: 21804
name_dtmf: 6246
prefix_dtmf:
--------------------------------------------------------

Name: Main St.

Left Zip: 60466
21801 21805
X------------------------------------>
21800 21804
Right Zip: 60443




4. Common Queries

a. Find the street segment record for 30 Rockefeller Plaza, NY, NY,
10020

SELECT * FROM address WHERE leftzip=10020 AND name LIKE 'Rockefeller'
AND 30 BETWEEN leftaddr1 AND leftaddr2
UNION
SELECT * FROM address WHERE rightzip=10020 AND name LIKE
'Rockefeller' AND 30 BETWEEN rightaddr1 AND rightaddr2

--------------------------------------------------------
id: 59657155
seq: 0
name: Rockefeller
prefix:
type: Plz
startlat: 40.7585
startlong: -73.979
endlat: 40.7592
endlong: -73.9786
leftzip: 10020
rightzip: 10020
leftaddr1: 22
leftaddr2: 38
rightaddr1: 21
rightaddr2: 39
name_dtmf: 7625335537
prefix_dtmf:
--------------------------------------------------------


b. Find all the street names within a zipcode.

SELECT DISTINCT prefix, name, type FROM address WHERE leftzip=10009
OR rightzip = 10009

+--------+----------+------+
| prefix | name | type |
+--------+----------+------+
| | 1st | Ave |
| | Avenue A | |
| E | 13th | St |
| E | 14th | St |
| E | 12th | St |
| E | 11th | St |
| E | 10th | St |
| E | 2nd | St |
| E | 3rd | St |
| E | 4th | St |
| E | 6th | St |
| E | 7th | St |
| | St Marks | Pl |
| E | 9th | St |
| | Avenue B | |
| E | 5th | St |
| | Avenue C | |
| E | 20th | St |
| E | 15th | St |
| E | 16th | St |
| E | 8th | St |
| | Avenue D | |
| | Szold | Pl |
+--------+----------+------+

c. Streets starting with F in a zipcode

SELECT DISTINCT prefix, name, type from address WHERE leftzip = 94117
AND name LIKE 'F%'
UNION
SELECT DISTINCT prefix, name, type from address WHERE rightzip =
94117 AND name LIKE 'F%'

+--------+------------+------+
| prefix | name | type |
+--------+------------+------+
| | Fulton | St |
| | Fell | St |
| | Frederick | St |
| | Farnsworth | Ln |
| | Fillmore | St |
| | Friendship | Ct |
+--------+------------+------+

d. Streets starting with "FRE" as DTMF digits in a zipcode:

SELECT DISTINCT prefix, name, type from address WHERE leftzip = 94117
AND name_dtmf LIKE '373%'
UNION
SELECT DISTINCT prefix, name, type from address WHERE rightzip =
94117 AND name_dtmf LIKE '373%'

+--------+------------+------+
| prefix | name | type |
+--------+------------+------+
| | Fulton | St |
| | Fell | St |
| | Frederick | St |
| | Farnsworth | Ln |
| | Fillmore | St |
| | Friendship | Ct |
+--------+------------+------+

Answer

Every sub-folder under the root folder /data-us-street represents a database. So if I want to use database addresses, I need to use folder ./addresses:

  1. Copy the target database folder into the data directory of the current MySQL Database Server instance, e.g. /var/lib/mysql

    $ cp -r /data-us-street/addresses /var/lib/mysql/addresses
    
  2. Change the folder owner recursively

    $ chown mysql:mysql /var/lib/mysql/addresses/*
    
  3. Use database addresses and enjoy !

Comments