Need help for spliting corpus in multiple documents with XPATH

Question

Need help for spliting corpus in multiple documents with XPATH

wilcar opened this issue 4 years ago · comments

I have press corpus with multiples articles from different newspapers that I consider as authors. I want to perform a text mining by understanding the different authors.
I have an XML file and I am a beginner : can you help to complete the importation options ?
Thank you for helping

Here the begining of my xml file :

  <?xml version="1.0" encoding="UTF-8"?>
      <root encoding="UTF-8">
        <record>
          <content>
      EVENEMENT, jeudi 12 mars 1998 555 mots, p. 4&#13;
      "Le plus complexe, c'est l'information du malade". Un médecin réanimateur a mené une&#13;
      enquête sur les attentes des patients.&#13;                                                      
         </content>
          <author>Libération</author>
          <dates>jeudi 12 mars 1998</dates>
          <publication_date>1998-03-12</publication_date>
          <longueur>5129</longueur>
        </record>
    </root>

Andrew MacDonald · Answer 1 · Fri Jan 08 2021 05:03:32 GMT+0800 (China Standard Time)

Try the following XPATHs:
contenu: //contents
auteur: //author
documents: //record
date de publication: //publication_date

Wilfrid Cariou · Answer 2 · Wed Jan 13 2021 20:11:31 GMT+0800 (China Standard Time)

Thank you for helping. It works great.