Merge all the contiguous chunks into a big single one

Question

Merge all the contiguous chunks into a big single one

marfox opened this issue 9 years ago · comments

Consider the following JSON output yielded by the chunk combination script:

 {
    "chunks": [
      "FEC",
      "Levallois",
      "USL",
      "Dunkerque",
      "FC",
      "Thépot",
      "Brest",
      "Red"
    ],
    "id": "12",
    "sentence": "Thépot giocò per il Brest, FEC Levallois, Red Star FC, e l'USL Dunkerque."
  }

It would be nice to have:

 {
    "chunks": [
      "FEC Levallois",
      "USL Dunkerque",
      "FC",
      "Thépot",
      "Brest",
      "Red"
    ],
    "id": "12",
    "sentence": "Thépot giocò per il Brest, FEC Levallois, Red Star FC, e l'USL Dunkerque."
  }

Star is not extracted, so no way to get Red Star FC (even if it would make sense).

Emilio Dorigatti · Answer 1 · Fri May 08 2015 07:27:35 GMT+0800 (China Standard Time)

I used the information obtained from the twm and tp files regarding start and end of each chunk. It is still not perfect though as defining "continuous" might be hard. The current definition considers contiguous only chunks which are separated only by letters, but this misses cases such as Coppa d'Africa:

  {
    "chunks": [
      "la Nazionale del suo paese", 
      "Coppa", 
      "Africa"
    ], 
    "id": "043", 
    "sentence": "Con la Nazionale del suo paese, ha giocato la Coppa d'Africa 2010."
  },

(don't mind the first commit please, I screwed up with branches)

Emilio Dorigatti · Answer 2 · Tue May 12 2015 22:59:29 GMT+0800 (China Standard Time)

Symbols such as commas and dash might be used to split elements of lists which shouldn't be incorporated into a single chunk. The sample is now transformed to

  {
    "chunks": [
      "Thépot", 
      "Brest", 
      "FEC Levallois", 
      "Red", 
      "FC", 
      "USL Dunkerque"
    ], 
    "id": "012", 
    "sentence": "Thépot giocò per il Brest, FEC Levallois, Red Star FC, e l'USL Dunkerque."
  }