GoogleCloudPlatform / dataflow-opinion-analysis

Opinion Analysis of News, Threaded Conversations, and User Generated Content

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Controller not listening to Pub/Sub commands

abaumann opened this issue · comments

This is a great and well documented example - much appreciated!

I am having trouble however with the Run a verification job step. When I publish the command=start_gcs_import message to the indexercommands topic I do not get a new input job created. I tried out the cron jobs to see if those were working, and I can see a new job created for startstatscalc only - not the others. I feel like I must have messed up some config step along the way.

Do you have tips for how to debug what is happening? I am not seeing any sort of logging to help me debug, but that's where I'd think to look first. It feels like the controller must not be listening for the topics correctly though...

Thanks!

yes, sorry, the controller pipeline has not yet been updated (after a recent upgrade from a pre 2.x SDK to a 2.2 SDK).

For the time being, launch the IndexerPipeline directly using the example in the Release Notes for version 0.6.4 https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis/releases/tag/v0.6.4

e.g. using

mvn compile exec:java
-Dexec.mainClass=com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline
-Dexec.args="--project=$PROJECT_ID
...

I will make a note in README not to run the controller pipeline but instead to start the IndexerPipeline

So I tried modifying run_controljob.sh to use IndexerPipeline as the mainClass, but I'm getting this error:

	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Class interface com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipelineOptions missing a property named 'controlPubsub'.
	at org.apache.beam.sdk.options.PipelineOptionsFactory.parseObjects(PipelineOptionsFactory.java:1579)
	at org.apache.beam.sdk.options.PipelineOptionsFactory.access$400(PipelineOptionsFactory.java:104)
	at org.apache.beam.sdk.options.PipelineOptionsFactory$Builder.as(PipelineOptionsFactory.java:291)
	at com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline.main(IndexerPipeline.java:110)
	... 6 more

Maybe it needs some additional flags that aren't in those release notes?

You may be adding this to the README and I can check it there

Just insert something like this into the shell script (and make sure the $ variables are set)
(this statement is from the release notes)

mvn compile exec:java
-Dexec.mainClass=com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline
-Dexec.args="--project=$PROJECT_ID
--runner=DataflowRunner
--maxNumWorkers=10
--workerMachineType=n1-standard-2
--stagingLocation=gs://$GCS_BUCKET/staging/
--tempLocation=gs://$GCS_BUCKET/temp/
--streaming=false
--autoscalingAlgorithm=THROUGHPUT_BASED
--bigQueryDataset=opinions
--writeTruncate=true
--processedUrlHistorySec=130000
--wrSocialCountHistoryWindowSec=610000
--ratioEnrichWithCNLP=0
--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.txt
--indexAsShorttext=false
"

Oh actually that worked - I thought I looked closely enough to see the flags were the same there as the existing template, but I guess not, thanks!

glad to hear, and sorry for the issue with the controller. I meant to fix it earlier

no problem - I just wanted a baseline of something working so I can start to modify from here to handle our use case, so now that I've seen this working, I'm good. Only other issue I noticed by the way is the instructions to run:

SELECT * FROM opinions.sentiment 
ORDER BY DocumentTime DESC
LIMIT 100

Since this contains repeated fields we get an error like this"
Cannot query the cross product of repeated fields Signals and Tags.GoodAsTopic

I really hate that select * doesn't do some default flattening in BQ, and I didn't do the work to come up with a query and just used the preview pane instead.

Thanks!

Uncheck the "Use Legacy SQL" checkbox so that BigQuery uses Standard SQL for that query. You can do that by going to "Show Options"

But, good point, I will add the
#standardSQL
tag to the query, so that it is easier to use this sample.

🎆 i did not know about unchecking Legacy SQL being a fix to this - thanks for that tip! 🎆