deeplearning4j / deeplearning4j-examples

Deeplearning4j Examples (DL4J, DL4J Spark, DataVec)

Home Page:http://deeplearning4j.konduit.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A bug in WebLogDataExample

yumg opened this issue · comments

When I run the example org.deeplearning4j.datapipelineexamples.transform.basic.WebLogDataExample,

I got an exception as below:

11:17:23.867 [Executor task launch worker for task 8] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 2.0 (TID 8)
java.lang.IllegalArgumentException: Invalid format: "01/Jul/1995:00:00:01 -0400" is malformed at "Jul/1995:00:00:01 -0400"
	at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:752)
	at org.datavec.api.transform.transform.time.StringToTimeTransform.map(StringToTimeTransform.java:243)
	at org.datavec.api.transform.transform.BaseColumnTransform.map(BaseColumnTransform.java:92)
	at org.datavec.spark.transform.transform.SparkTransformFunction.call(SparkTransformFunction.java:48)
	at org.datavec.spark.transform.transform.SparkTransformFunction.call(SparkTransformFunction.java:32)
	at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

That is because my default locale is Locale.CHINA, the program can not recognize the month word Jul.

The locale needs to be specified explicitly.

I found there is an API that can explicitly set the locale for the DateStringformatter org.datavec.api.transform.TransformProcess.Builder.stringToTimeTransform(String column, String format, DateTimeZone dateTimeZone, Locale locale)

So, we can fix the bug , by modify the WebLogDataExample's line 140 changing the original call (stringToTimeTransform(String column, String format, DateTimeZone dateTimeZone)) to that API.

It seems like the right solution. But unfortunately, I found the API with explicitly defining the locale is not working....

I will work on it.

The problem was eventually found to be in the code org.datavec.api.transform.transform.time.StringToTimeTransform

The method readObject has a bug:

    private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
        in.defaultReadObject();
        if(timeFormat != null)
            formatter = DateTimeFormat.forPattern(timeFormat).withZone(timeZone);
        else {
            List<DateTimeFormatter> dateFormatList = new ArrayList<>();
            formatters = new DateTimeFormatter[formats.length];
            for(int i = 0; i < formatters.length; i++) {
                dateFormatList.add(DateTimeFormat.forPattern(formats[i]).withZone(timeZone));
            }

            formatters = dateFormatList.toArray(new DateTimeFormatter[dateFormatList.size()]);
        }
    }

It should be like this:

    private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
        in.defaultReadObject();
        if(timeFormat != null)
            if (locale != null) {
                 this.formatter = DateTimeFormat.forPattern(timeFormat).withZone(timeZone).withLocale(locale);
             } else {
                 this.formatter = DateTimeFormat.forPattern(timeFormat).withZone(timeZone);
             }
        else {
            List<DateTimeFormatter> dateFormatList = new ArrayList<>();
            formatters = new DateTimeFormatter[formats.length];
            for(int i = 0; i < formatters.length; i++) {
                dateFormatList.add(DateTimeFormat.forPattern(formats[i]).withZone(timeZone));
            }

            formatters = dateFormatList.toArray(new DateTimeFormatter[dateFormatList.size()]);
        }
    }