m.nachury m.nachury - 2 months ago 10
Python Question

Need advice on RNN model to format Strings

The situation



I'm currently learning Tensorflow and for a first try (after following/trying the MINST tutorials) I would like to create a model (probably RNN) to do some basic String formatting:

I know I may not need something as complex as deep learning for the following case, but It's just for training myself.

I have a set of supposed "clean address" string in which I want to extract the actual clean address.

Hers is the kind of transformation I want to get:

RUE DE MADAGASCAR --> RUE DE MADAGASCAR
ZI DE LA PLAINE
55 RUE DU 1ER SEPTEMBRE 1944 --> 55 RUE DU 1ER SEPTEMBRE 1944
ZONE INDUSTRIELLE RUE DE LA VALLEE B.P. 8 --> RUE DE LA VALLEE
BP 62 AVENUE BECQUEREL --> AVENUE BECQUEREL
291 VOIE ATLAS --> 291 VOIE ATLAS
12 RUE ARMAND BUSQUET ZONE INDUSTRIELLE --> 12 RUE ARMAND BUSQUET
DOSSIER MLOC 5 RUE AMABLE LOZAI --> 5 RUE AMABLE LOZAI
ZI CAEN CANAL -->
RUE DE L'EUROPE ZI PORTUAIRE --> RUE DE L'EUROPE
BP 5229 BOULEVARD HENRY BECQUEREL CAMPUS JULES HOROWITZ --> BOULEVARD HENRY BECQUEREL
GIE MONSIEUR GAUTIER BOULEVARD H. BECQUEREL BP 5027 --> BOULEVARD H. BECQUEREL
21 PLACE DE LA REPUBLIQUE --> 21 PLACE DE LA REPUBLIQUE
18 RUE DE LA GIRAFE --> 18 RUE DE LA GIRAFE
21 RUE DES GOUDRIERS --> 21 RUE DES GOUDRIERS
AVENUE STRASSBURGER --> AVENUE STRASSBURGER
7 RUE DE L'EGLISE --> 7 RUE DE L'EGLISE
1060 RUE LEON FOUCAULT ZI DE LA SPHERE --> 1060 RUE LEON FOUCAULT


I you need more examples : here is a link to a spreadsheet with 200 elements (planning to expand it to 1000 - 5000 elements)

As you can see there is a lot of recognizable pattern:


  • Don't take
    BP
    words and the 2 or 4 digits that come after

  • Don't take
    ZI
    ,
    ZA
    or
    Zone d'activiter
    ...

  • Address usually look like
    00 (Rue|Voie|Avenue|...) nameOfStreet

  • etc...



How I think to proceed



I'm trying to get an output string which is a part of the input string. It shall remove word based on patterns described above.

I think that I will go on a RNN type of graph since It should detect things like, "there is a "BP" so I'm not taking this word and if the next input is a 2 or 4 digits String I'm not taking those either", I think there should be some kind of memory.

It all depends on the way I want to input my data. I think I have two or three ways of doing that:


  • Input single words (split by space)

  • Input entire String (entire address)

  • Input a string, then split it on a deeper layer?



The thing is:


  • If I input single words, how do I mark the string separation?

  • If I input entire string, It seems a bit like a lost since the

    systems is only going to take or remove single word.

  • Does the third option (mixing the two) even make sense?



Is it possible to train in batch and use the "batch part" to input multiple words and every batch represent and address.

Also, I wonder if in my system the weight of the nodes are going to be all 0 and 1 (since it should can only take or remove single words) or if it's going to be intermediate values like a probability of keeping the word.

Recap of the process




  1. Create a dictionary of all single Words

  2. Pad my Strings to the same length?

  3. Convert all my strings (or word?) into a 1D array

  4. Define the graph

  5. Input String (or word?) by small batches

  6. Test and display accuracy (Shall the output string be an exact match of the expected output or a % of diff between the expected output and the output is more interesting?)

  7. Save the graph

  8. Use it to format my Strings



Thanks a lot for reading it through all that, any help would be appreciated.

Especially regarding the general direction I'm heading, and the way of inputting my data to the graph.

Answer Source

There's two ways of approaching the problem that immediately come to mind:

  • Sequence tagging - Label each word in the input with a 1 or a 0 indicating whether or not it should be kept.
  • seq2seq model - Let the RNN read the whole input and then produce an output word-by-word or character-by-character.

If you're just starting, I would recommend the sequence tagging model. If you want to do this, the steps I would follow are:

  1. Represent the input as a sequence of one-hot vectors (each dimension represents a word)
  2. Represent the labels as a sequence of 1's and 0's (indicating if each word should be kept or not)
  3. Use a rnn to read each sequence
  4. Use a 2-node layer to output a score for class 1 and class 0 for each word
  5. Use an optimizer to minimize the difference between predicted and actual label

For an example of how to do sequence tagging in tensorflow, take a look at: https://github.com/guillaumegenthial/sequence_tagging