facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ERROR in download-data.sh

nxphi47 opened this issue · comments

Thank you for this project and the paper.

I have issue with bash download-data.sh

I think the error happens at line 155 when it tries to download the file https://anoopk.in/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz

Using web browser, the link appears to be dead.

The line: download_data $DATA/en-hi.tgz "https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz"

Downloading https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz
--2019-09-27 17:18:36--  https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz
Resolving www.cse.iitb.ac.in (www.cse.iitb.ac.in)... 103.21.127.134
Connecting to www.cse.iitb.ac.in (www.cse.iitb.ac.in)|103.21.127.134|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://anoopk.in/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz [following]
--2019-09-27 17:18:38--  https://anoopk.in/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz
Resolving anoopk.in (anoopk.in)... 184.168.131.241
Connecting to anoopk.in (anoopk.in)|184.168.131.241|:443... connected.
ERROR: no certificate subject alternative name matches
	requested host name ‘anoopk.in’.
To connect to anoopk.in insecurely, use `--no-check-certificate'.
https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz not successfully downloaded.

Thank you,

I got the same error. I tried to add --no-check-certificate in download_data

# Download data
download_data() {
  CORPORA=$1
  URL=$2

  if [ -f $CORPORA ]; then
    echo "$CORPORA already exists, skipping download"
  else
    echo "Downloading $URL"
    wget --no-check-certificate  $URL -O $CORPORA || rm -f $CORPORA
    if [ -f $CORPORA ]; then
      echo "$URL successfully downloaded."
    else
      echo "$URL not successfully downloaded."
      rm -f $CORPORA
      exit -1
    fi
  fi
}

However, I couldn't download the data correctly. I think the server, anoopk.in, has some problems.

The server seems to be back again. Please reopen in case you are still observing this issue

The server seems to change the directory which placed the data.

I found this: http://www.cfilt.iitb.ac.in/~moses/iitb_en_hi_parallel/dataset.html

It requires to input some information to download the data.
After submitting the form, we can download it, but the URL is not the same as download-data.sh assumed.

If we will access to the URL which has ~anoopk, but not ~moses (https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz), it will redirect to https://anoopk.in/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz but the anoopk.in server is not stable and still can't download parallel.tgt from there, and even if --no-check-certificate option is added, the downloaded file might not be the correct one.

Okay, I checked the code.

download_data $DATA/en-hi.tgz "http://www.cfilt.iitb.ac.in/iitb_parallel/iitb_corpus_download/parallel.tgz"

This new URL seems to work. Thanks!