I'm currently learning Tensorflow and for a first try (after following/trying the MINST tutorials) I would like to create a model (probably RNN) to do some basic String formatting:
I know I may not need something as complex as deep learning for the following case, but It's just for training myself.
I have a set of supposed "clean address" string in which I want to extract the actual clean address.
Hers is the kind of transformation I want to get:
RUE DE MADAGASCAR --> RUE DE MADAGASCAR
ZI DE LA PLAINE
55 RUE DU 1ER SEPTEMBRE 1944 --> 55 RUE DU 1ER SEPTEMBRE 1944
ZONE INDUSTRIELLE RUE DE LA VALLEE B.P. 8 --> RUE DE LA VALLEE
BP 62 AVENUE BECQUEREL --> AVENUE BECQUEREL
291 VOIE ATLAS --> 291 VOIE ATLAS
12 RUE ARMAND BUSQUET ZONE INDUSTRIELLE --> 12 RUE ARMAND BUSQUET
DOSSIER MLOC 5 RUE AMABLE LOZAI --> 5 RUE AMABLE LOZAI
ZI CAEN CANAL -->
RUE DE L'EUROPE ZI PORTUAIRE --> RUE DE L'EUROPE
BP 5229 BOULEVARD HENRY BECQUEREL CAMPUS JULES HOROWITZ --> BOULEVARD HENRY BECQUEREL
GIE MONSIEUR GAUTIER BOULEVARD H. BECQUEREL BP 5027 --> BOULEVARD H. BECQUEREL
21 PLACE DE LA REPUBLIQUE --> 21 PLACE DE LA REPUBLIQUE
18 RUE DE LA GIRAFE --> 18 RUE DE LA GIRAFE
21 RUE DES GOUDRIERS --> 21 RUE DES GOUDRIERS
AVENUE STRASSBURGER --> AVENUE STRASSBURGER
7 RUE DE L'EGLISE --> 7 RUE DE L'EGLISE
1060 RUE LEON FOUCAULT ZI DE LA SPHERE --> 1060 RUE LEON FOUCAULT
I you need more examples : here is a link to a spreadsheet
with 200 elements (planning to expand it to 1000 - 5000 elements)
As you can see there is a lot of recognizable pattern:
How I think to proceed
I'm trying to get an output string which is a part of the input string. It shall remove word based on patterns described above.
I think that I will go on a RNN type of graph since It should detect things like, "there is a "BP" so I'm not taking this word and if the next input is a 2 or 4 digits String I'm not taking those either", I think there should be some kind of memory.
It all depends on the way I want to input my data. I think I have two or three ways of doing that:
- Input single words (split by space)
- Input entire String (entire address)
- Input a string, then split it on a deeper layer?
The thing is:
- If I input single words, how do I mark the string separation?
- If I input entire string, It seems a bit like a lost since the
systems is only going to take or remove single word.
- Does the third option (mixing the two) even make sense?
Is it possible to train in batch and use the "batch part" to input multiple words and every batch represent and address.
Also, I wonder if in my system the weight of the nodes are going to be all 0 and 1 (since it should can only take or remove single words) or if it's going to be intermediate values like a probability of keeping the word.
Recap of the process
- Create a dictionary of all single Words
- Pad my Strings to the same length?
- Convert all my strings (or word?) into a 1D array
- Define the graph
- Input String (or word?) by small batches
- Test and display accuracy (Shall the output string be an exact match of the expected output or a % of diff between the expected output and the output is more interesting?)
- Save the graph
- Use it to format my Strings
Thanks a lot for reading it through all that, any help would be appreciated.
Especially regarding the general direction I'm heading, and the way of inputting my data to the graph.