If I understand you correctly, this is equivalent to the shortest common supersequence problem. This problem is NP-complete, but there exists approximation algorithms. Google turns up a few papers including this one.
If I understand you correctly, this is equivalent to the shortest common supersequence problem. This problem is NP-complete, but there exists approximation algorithms. Google turns up a few papers, including this one.
The problem can be solved with a simple DP algorithm in the case of two input sequences, but this doesn't generalize to multiple sequences since each sequence essentially requires you to add a dimension to the DP table which results in the exponential blow-up.
I had not previously heard of this problem. Thanks for the link! – templatetypedef Apr 26 at 18:44 Neither had I, but it sort of felt dual to the longest common subsequence, so I just tried flipping the words and there it was :) – hammar Apr 26 at 20:38.
I'm a bioinformatician, and this sounds like it could be solved by doing a global multiple sequence alignment of all the sentences with infinite mismatch penalties (i.e. Disallow mismatches entirely) and modest gap penalties (i.e. Allow gaps, but prefer fewer gaps), and then reading off the gapless consensus sequence.
If this formulation is equivalent to your problem, then that means your problem is indeed NP-complete, since multiple sequence alignment is NP-complete, although there are many heuristic algorithms that run in reasonable time. Unfortunately, most MSA algorithms are designed to work on characters of DNA or protein sequences, not words of English. Example Here is an example of the kind of alignment that I describe, using the set of three sentences given by the OP.
I don't know if the alignment that I give is optimal, but it is one possible solution. Gaps are indicated by a series of dashes. Sentence 1: ---- -- A red cat -- -- --- ----- -- --- Sentence 2: ---- My - --- cat is on the table -- --- Sentence 3: That -- - --- --- -- -- --- table is red Consensus: That My A red cat is on the table is red One advantage of this method is that the alignment not only gives you the full sequence of words, but shows which words belong in which sentences.
1 I think this is the right approach, but it still works if you allow mismatches. OP, to turn this into a sequence of words for your panel, I think you actually need to read down each column, "writing out" one copy of every distinct word in the column. – j_random_hacker Apr 26 at 5:01 2 Actually, what I have described here in terms of multiple alignment is in fact equivalent to the shortest common supersequence problem, as described in another answer.So that answer should be the right one.
– Ryan Thompson Apr 26 at 9:20 1 The point I was trying to make is that as you now have it, this gives only the "invariant" part of the supersequence. I.e. For the OP's example, it would produce "The is on the table".
Also I don't yet see what forbidding mismatches gains you, since any column containing k > 1 distinct words can be trivially turned into k columns each having only copies of a single word and one or more gaps. Can you explain? – j_random_hacker Apr 26 at 10:20 @j_random_hacker - I have the feeling you are trying to express something very relevant to my case, but I can't fully understand your point.
Can you articulate a bit more the thought in your first comment (I got the second one). – mac Apr 26 at 16:45 Forbidding mismatches means that each "column" of the alignment will contain only one word, and any sentence that doesn't need the word in that column will simply have a gap there. I will try to edit my answer with an example.
– Ryan Thompson Apr 26 at 22:42.
I have no actual proof, but I would be very surprised if your problem was not either NP-complete or NP-hard. I can find an algorithm which is guaranteed to give the best answer. But with even a small number of sentences of reasonable length (eg 16 sentences with 8 words each) it will bog down and not run in any reasonable time.
Therefore I would advise you not to look for exact answers, but instead to look for good heuristics that give reasonable algorithms. The heuristic that I would suggest would be to take pairs of sentences, solve the en.wikipedia.org/wiki/Longest_common_sub... for them (a solution for which is in the Python standard library, look up difflib), resolve the conflicted sections randomly, and replace the pair with that one sentence. Do this over all of your sentences, and you'll have an OK solution.
Not a great one, but an OK one. If you want to get clever, you could look for incremental improvements, run this multiple times, try to be smart about how you randomly resolve those differences, etc. But the simple approach will probably give reasonably good answers without too much effort.
1 +1, but the devil is of course in the details. The problem is nontrivial, but that doesn't automatically make it intractable ;) – abeln Apr 26 at 2:18 This looks like could be the answer I am looking for! I will code this and come back when I will have actual results.
For now, thank you! – mac Apr 26 at 9:59.
After a week of so of coding (this is hobby project) I decided to answer my own question. This is not because the answers I previously got weren't good enough, but rather because I used the three of them to code the solution I wanted, and it felt wrong to give credit to only one of the responders, as I truly used input by the three of them to come up with a satisfactory solution. Let's start from the end: the heuristic I came up gives very satisfactory results (at least for the real-life case I am using it for).
I had 1440 sentences with an average of ~6 words each and using a set of 70 unique words. The program takes about 1 minute to run and provides me with a supersequence of just 76 words (10% more than the "physical" but not "logical" lower limit of the problem). The heuristic is really tailored around my real-life case, in particular around the fact that most of the sentences are constructed around 10 or so "prototypes" (see point #2 of my edit in the question) and is composed of four successive steps: Isomorphic shrinking Similarity shrinking Coarse redundancy optimisation Fine redundancy optimisation Isomorphic shrinking I defined as "isomorphic" two sentences A and B such than transforming A in B and B in A would require exactly the same steps.
Example: 'A dog is on the carpet' 'A cat is under the table' The transformation always require to change words in position 1, 3, 5 of the first string with words in position 1, 3, 5 of the second. Grouping sentences in "isomorphic families" allows to easily create optimised superstrings by simply inserting in the common root "A __ is __ the __" the list of variable elements for position 1, 3, 5. Similarity shrinking At this stage of the process the number of sentences has dramatically lowered (there is normally a mixture of about 50 supersequences from isomorphic families and orphan sentences that were not isomorphic to any other in the pool).
The program analyses this new pool and merges together the two most similar strings, then repeating the procedure on the new set composed by the result of the merge and the strings that haven't been merged yet, and so on until everything has been merged into one supersequence. Coarse redundancy optimisation At this stage we already have a valid supersequence to all the original 1440 sentences, but such supersequence is grossly suboptimal as many terms are repeated without the need for it. This optimisation step remove the bulk of the redundant terms by simply using the supersequence to formulate all the sentences, and then removing from it all the terms that haven't been used.
Fine redundancy optimisation The result of the previous optimisation is pretty good already, but sometimes is possible to trim out a word of two via this last step. The optimisation here works by finding words that are repeated more than once, and checking if it possible to make two successive repetitions to converge towards the same location in the supersequence and still formulate all the sentences. Example: given the supersequence aaa xxx bbb ccc ddd eee xxx fff the program will try to shift the two xxx words towards each other: aaa bbb xxx ccc ddd eee xxx fff aaa bbb xxx ccc ddd xxx eee fff aaa bbb ccc xxx ddd xxx eee fff if the supersequence reaches: aaa bbb ccc xxx xxx ddd eee fff and no sentence uses both occurrences of xxx at the same time, then one of the two xxx can go.
Of course this last passage could be optimised by shifting xxx of more than one position at a time (think of shell sorting vs. bubble sorting), but the general structure of the program is such, and the gain in speed so little, that I preferred this "less optimised" procedure, as this "shifting procedure" is used elsewhere too. Again, many thanks to all the original responder: your input was paramount to make me think of this solution. Your comments to this solution are also very welcomed!
FINAL NOTE: As soon as the program will be complete / the project finished (a couple of months at worst), I will release it under GPL and add the link to the code repo in the original question. I believe that marking the question as "favourite" should notify the marker of any edit.... just in case you are interested in the actual code, of course!
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.