lyfeyaj / jaql

Automatically exported from code.google.com/p/jaql

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Weird problem with keyLookup

GoogleCodeExporter opened this issue · comments

What steps will reproduce the problem?

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?
jaql trunk r248

Please provide any additional information below.
There seems to be some weird problem with KeyLookup that has popped up
recently. It seems it is unable to read some of the temp file which it
itself is generating. This is new problem since I didn't face this problem
until last night. Here is how it goes.

Following is the JAQL code:
$ratings = read(hdfs('/user/sudipto/netflix/data/all/json'));
$estrate = 0;
$cust = read(hdfs('/user/sudipto/netflix/data/all/materialized/custparam'));
$movie = read(hdfs('/user/sudipto/netflix/data/all/materialized/movieparam'));

$imHashJoin = fn($outer, $okey, $inner, $ikey) (
  $build = $inner -> transform [$ikey($), $],
  $outer -> transform [$okey($), $]
         -> keyLookup($build)
         -> transform {$[1].*, $[2].*}
);

$ratings
  -> $imHashJoin(fn($r) $r.tid, $movie, fn($m) $m.movie_id)
  -> $imHashJoin(fn($r) $r.cid, $cust, fn($c) $c.cust_id)
  -> transform { $.cust_id, $.movie_id, $.rating, diff: $.rating - $estrate,
      $.cparam, $.mparam }
  -> write(hdfs('/user/sudipto/netflix/data/all/materialized/join'));

In the hash join, it first spawns up a MR job to read in the individual
inner tables, and Temps them. Then it tries to join the large outer table
and the temped inner table (This is somewhat new as I think the earlier
version of key lookup did not do so. Probably you wanted to fix the
inlining problem of this expression?). Nevertheless, when the 3rd MR job is
spawned which does the main join, it reports the following error:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
hdfs://impl00.almaden.ibm.com:9000/user/sudipto/jaql_temp_4847551314303483
         at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
         at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:210)
         at
com.ibm.jaql.io.hadoop.DefaultHadoopInputAdapter.getSplits(DefaultHadoopInputAda
pter.java:163)
         at
com.ibm.jaql.io.hadoop.DefaultHadoopInputAdapter.iter(DefaultHadoopInputAdapter.
java:184)
         at com.ibm.jaql.lang.expr.io.AbstractReadExpr$1.(AbstractReadExpr.java:100)
         at com.ibm.jaql.lang.expr.io.AbstractReadExpr.iter(AbstractReadExpr.java:99)
         at com.ibm.jaql.lang.expr.index.KeyLookupFn.iter(KeyLookupFn.java:72)
         at com.ibm.jaql.lang.expr.core.BindingExpr.iter(BindingExpr.java:209)
         at com.ibm.jaql.lang.expr.core.TransformExpr.iter(TransformExpr.java:148)
         at com.ibm.jaql.lang.expr.core.DoExpr.iter(DoExpr.java:126)
         at com.ibm.jaql.lang.core.JaqlFunction.iter(JaqlFunction.java:269)
         at com.ibm.jaql.lang.core.JaqlFunction.iter(JaqlFunction.java:350)
         at
com.ibm.jaql.lang.expr.hadoop.MapReduceBaseExpr$MapEval.run(MapReduceBaseExpr.ja
va:358)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198

Following is the explain:

(
  $fd_2 = mapReduce({
("input"):(hdfs("/user/sudipto/netflix/data/all/materialized/custparam")),
("output"):(HadoopTemp()), ("map"):(fn($mapIn) ( $mapIn
-> transform each $ ([null, [($).("cust_id"), $]]) )
) }),
  $fd_0 = mapReduce({
("input"):(hdfs("/user/sudipto/netflix/data/all/materialized/movieparam")),
("output"):(HadoopTemp()), ("map"):(fn($mapIn) ( $mapIn
-> transform each $ ([null, [($).("movie_id"), $]]) )
) }),
  write((
  $fd_1 = mapReduce({
("input"):(hdfs("/user/sudipto/netflix/data/all/json")),
("output"):(HadoopTemp()), ("map"):(fn($mapIn) ( keyLookup($mapIn
-> transform each $ ([($).("tid"), $]), read($fd_0))
-> transform each $ ([null, { (index($, 1)).*, (index($, 2)).* }]) )
) }),
  keyLookup(read($fd_1)
-> transform each $ ([($).("cid"), $]), read($fd_2))
-> transform each $ ({ (index($, 1)).*, (index($, 2)).* })
)

-> transform each $ ({ (($)).("cust_id"), (($)).("movie_id"),
(($)).("rating"), ("diff"):((($).("rating"))-(0)), (($)).("cparam"),
(($)).("mparam") }), hdfs("/user/sudipto/netflix/data/all/materialized/join"))
)


Note that this problem was encountered when the mapreduce cluster was
running under a username different from the user account used for
submitting jobs from another remote machine.

Original issue reported on code.google.com by sudipt...@gmail.com on 22 Jul 2009 at 2:01

Jasper and me had the same problem, he commited the fix for that in version r263

Original comment by moritzka...@web.de on 13 Aug 2009 at 7:52

  • Changed state: Fixed