Amrith Krishna Amrith Krishna - 3 months ago 8x
Python Question

Efficient data structure for storing data with relative ordering

I have to store a sentence along with the possible semgments of the sentence into an efficient data structure. Currently I use a dictionary followed by a list for each key of the dictionary to store the segments. Can I use a better data structure to store the same efficiently. I have detailed the whole requirements below.

Input sentence with possible candidate segments

Here, the sentence starts with

, the one without any background colour. Each of the colored boxes numbered 1 to 24 are segments of the sentence.

Now currently I store the following as follows

class sentence:
sentence = "pravaramuku....."
segments = dict()

The keys are starting position of the box relative to the sentence and values are objects storing details of each of the box.

segments = {0: [pravara_box1, pravara_box10],

Two boxes are said to be conflicting, if the
of one of the boxes is in between the
key+len(word in box)
of the other box (the range is inclusive). For example, Box 7 and Box 15 are conflicting and so is boxes 3 and 11.

In the program, one of the boxes will be selected as winner which is decided by a magic method. Once a winner is selected, its conflicting boxes are removed. Again another box is selected and this iteratively continues till no boxes remain.

Now, Currently my data-structure as you can see is a dictionary with each key has a list as its value.

What would be a better data structure to handle this, as currently the eliminating conflicting nodes portion is taking a lot of time.

My requirements can be summarized as follows:

  • What can be an efficient data structure for the following data to be stored so as to have faster processing.

  • The relative position of each box needs to be stored. Is there a better way to explicitly mark the conflicting nodes(may be with something like pointers in C)

  • This is a tree, but there is no sequential order traversal, as random access of box is required i.e any box needs to be called (with O(1)) rather than traversing from one to other.

  • The creation of data-structure is a single time operation, and hence the whole insertion process can be time taking, but accessing the boxes and eliminating the conflicting nodes needs to be done repetitively and hence requires speed up there.

Any help that can partially solve my problems are appreciated.


It seems like you could get away with a backtracking depth-first-search on a properly constructed tree:

sentence = "pravaramuku.........yugalah"
words = sentenceToWords(sentence)  # it seems like you already have this

tree = collections.defauldict(list)
for word in words:
    for i in (i for i in range(len(sentence)) if sentence[i:i+len(word)] == word):

Once that's done, you just need a depth first traversal of your tree:

def makeSentences(tree, pos=None, sofar=None):
    if pos is None: pos = 0
    if sofar is None: sofar = []
    if pos not in tree: print(' '.join(sofar))
    for word in tree[pos]:
        makeSentences(tree, pos+len(word), sofar+[word])

And then: