dbpedia / fact-extractor

Fact Extraction from Wikipedia Text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Merge all the contiguous chunks into a big single one

marfox opened this issue · comments

Consider the following JSON output yielded by the chunk combination script:

 {
    "chunks": [
      "FEC",
      "Levallois",
      "USL",
      "Dunkerque",
      "FC",
      "Thépot",
      "Brest",
      "Red"
    ],
    "id": "12",
    "sentence": "Thépot giocò per il Brest, FEC Levallois, Red Star FC, e l'USL Dunkerque."
  }

It would be nice to have:

 {
    "chunks": [
      "FEC Levallois",
      "USL Dunkerque",
      "FC",
      "Thépot",
      "Brest",
      "Red"
    ],
    "id": "12",
    "sentence": "Thépot giocò per il Brest, FEC Levallois, Red Star FC, e l'USL Dunkerque."
  }

Star is not extracted, so no way to get Red Star FC (even if it would make sense).

I used the information obtained from the twm and tp files regarding start and end of each chunk. It is still not perfect though as defining "continuous" might be hard. The current definition considers contiguous only chunks which are separated only by letters, but this misses cases such as Coppa d'Africa:

  {
    "chunks": [
      "la Nazionale del suo paese", 
      "Coppa", 
      "Africa"
    ], 
    "id": "043", 
    "sentence": "Con la Nazionale del suo paese, ha giocato la Coppa d'Africa 2010."
  }, 

(don't mind the first commit please, I screwed up with branches)

Symbols such as commas and dash might be used to split elements of lists which shouldn't be incorporated into a single chunk. The sample is now transformed to

  {
    "chunks": [
      "Thépot", 
      "Brest", 
      "FEC Levallois", 
      "Red", 
      "FC", 
      "USL Dunkerque"
    ], 
    "id": "012", 
    "sentence": "Thépot giocò per il Brest, FEC Levallois, Red Star FC, e l'USL Dunkerque."
  }