keedio / flume-ng-sql-source

Flume Source to import data from SQL Databases

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some issues with incremental data , add garbage

HbnKing opened this issue · comments

commented

Hi
I am using## cloudera (version 5.14.2) ## (flume1.6.0+cdh5.14.2+181) , flume ng-sql-source (version 1.5.2 ) Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 ##

here is my setting in flume-conf

agent.channels.ch1.type = memory  
agent.sources.sql-source.channels = ch1  
agent.channels = ch1  
agent.sinks = k1
  
agent.sources = sql-source  
agent.sources.sql-source.type = org.keedio.flume.source.SQLSource  
  
agent.sources.sql-source.hibernate.connection.url = jdbc:oracle:thin:@192.168.0.19:1521:orcl
agent.sources.sql-source.hibernate.connection.user = username
agent.sources.sql-source.hibernate.connection.password = password
agent.sources.sql-source.table = TKAFKA
agent.sources.sql-source.columns.to.select = *  
  
agent.sources.sql-source.incremental.column.name = id  
agent.sources.sql-source.incremental.value = 0  
  
agent.sources.sql-source.run.query.delay=5000  
 
agent.sources.sql-source.hibernate.connection.provider_class = org.hibernate.connection.C3P0ConnectionProvider
agent.sources.sql-source.hibernate.c3p0.min_size=1
agent.sources.sql-source.hibernate.c3p0.max_size=10
 
agent.sources.sql-source.status.file.path = /var/log/apache-flume-1.6.0-bin
agent.sources.sql-source.status.file.name = sqlSource.status
  
agent.sinks.k1.channel = ch1  
agent.sinks.k1.type = hdfs  
agent.sinks.k1.hdfs.path = hdfs://172.161.5.130/flume/sql
agent.sinks.k1.hdfs.fileType = DataStream  
agent.sinks.k1.hdfs.writeFormat = Text  
agent.sinks.k1.hdfs.rollSize = 268435456  
agent.sinks.k1.hdfs.rollInterval = 0  
agent.sinks.k1.hdfs.rollCount = 0 

But when I insert new data to the table the output in hdfs has one more column .
MY table has only two columns , ID is primary key !
out put example as follow
"211","53"
"212","8"
"213","3"
"214","32"
"215","23"
"196","44"
"197","88"
"218","74","21"
"219","66","22"
"220","50","23"
"196","44","24"
"197","88","25"
"223","49","26"
"224","48","27"
"225","18","28"
"196","44","29"
"197","88","30"

what is more it also has some data repeat

here is some messages gives to you ... it may caused in ## HibernateHelper executeQuery( ) ..
the data has changed when query .

2018-05-21 16:48:48,545 (PollableSourceRunner-SQLSource-sql-source) [INFO - org.keedio.flume.source.HibernateHelper.executeQuery(HibernateHelper.java:142)] qury result
2018-05-21 16:48:48,545 (PollableSourceRunner-SQLSource-sql-source) [INFO - org.keedio.flume.source.HibernateHelper.executeQuery(HibernateHelper.java:143)] [[223, 49, 26], [224, 48, 27], [225
commented

I have also meet @akrishnankogentix problem Flume restarts with this setting pulls data from beginnning only.

Hi HbnKing,

  • Flume-ng-sql 1.5.2 is not stable yet, could you try with latest release 1.5.1 ?. Also, it could help a lot, a full stack trace, since agent starts until the exception or error starts repeating.

  • I have been trying to reproduce your problem in a MySQL server, but i am not being sucessfull. My environment:

Cloudera, flume-core, and custom source flume version:###

[root@quickstart flume-sql]# /usr/bin/flume-ng version
Flume 1.6.0-cdh5.12.0

[root@quickstart flume-sql]# ls -l  /usr/lib/flume-ng/plugins.d/flume-sql-source/lib/
flume-ng-sql-source-1.5.1.jar

My config file for agent flume-sql:

##flume-sql

agent.sinks = shdfs1
agent.channels = ch1 
agent.sources = sql1

# For each one of the sources, the type is defined
agent.sources.sql1.type = org.keedio.flume.source.SQLSource

agent.sources.sql1.hibernate.connection.url = jdbc:mysql://127.0.0.1:3306/testdb

# Hibernate Database connection properties
agent.sources.sql1.hibernate.connection.user = root
agent.sources.sql1.hibernate.connection.password = root
agent.sources.sql1.hibernate.connection.autocommit = true
agent.sources.sql1.hibernate.dialect = org.hibernate.dialect.MySQL5Dialect
agent.sources.sql1.hibernate.connection.driver_class =  com.mysql.jdbc.Driver

agent.sources.sql1.table = customers

# Columns to import to kafka (default * import entire row)
agent.sources.sql1.columns.to.select = *

# Query delay, each configured milisecond the query will be sent
agent.sources.sql1.run.query.delay=10000

# Status file is used to save last readed row
agent.sources.sql1.status.file.path = /var/log/flume-sql
agent.sources.sql1.status.file.name = sql1.status

agent.sources.sql1.column.name = customer_id  
agent.sources.sql1.incremental.value = 0  
  

agent.sources.sql1.batch.size = 1000
agent.sources.sql1.max.rows = 1000000
agent.sources.sql1.delimiter.entry = ;
agent.sources.sql1.enclose.by.quotes = false

agent.sources.sql1.hibernate.connection.provider_class = org.hibernate.connection.C3P0ConnectionProvider
agent.sources.sql1.hibernate.c3p0.min_size=1
agent.sources.sql1.hibernate.c3p0.max_size=10


agent.sinks.k1.type = file_roll
agent.sinks.k1.sink.directory = /var/log/flume-sql
agent.sinks.k1.sink.rollInterval = 7200
agent.sinks.k1.channel = ch1

agent.channels.ch1.type = memory
agent.channels.ch1.capacity = 10000
agent.channels.ch1.transactionCapacity = 1000


agent.sources.sql1.channels = ch1

agent.sinks.shdfs1.channel = ch1  
agent.sinks.shdfs1.type = hdfs  
agent.sinks.shdfs1.hdfs.path = hdfs://127.0.0.1/user/cloudera/flume
agent.sinks.shdfs1.hdfs.fileType = DataStream  
agent.sinks.shdfs1.hdfs.writeFormat = Text  
agent.sinks.shdfs1.hdfs.rollSize = 268435456  
agent.sinks.shdfs1.hdfs.rollInterval = 0  
agent.sinks.shdfs1.hdfs.rollCount = 0 

I have a table with 1000 rows:

mysql> SELECT * FROM testdb.customers;
+-------------+------------------------------------------+-----------+
| customer_id | first_name                               | last_name |
+-------------+------------------------------------------+-----------+
|           1 | event_1_GenericCSV061418_16_0728_xmRhP   | FXDO      |
|           2 | event_2_GenericCSV061418_16_0728_xmRhP   | h5hJ      |
|           3 | event_26_GenericCSV061418_16_0728_xmRhP  | e47R      |
|           4 | event_3_GenericCSV061418_16_0728_xmRhP   | aIvL      |
|           5 | event_27_GenericCSV061418_16_0728_xmRhP  | H7yi      |
|           6 | event_4_GenericCSV061418_16_0728_xmRhP   | guco      |
|           7 | event_28_GenericCSV061418_16_0728_xmRhP  | ECK6      |
|           8 | event_5_GenericCSV061418_16_0728_xmRhP   | Jt03      |
|           9 | event_29_GenericCSV061418_16_0728_xmRhP  | wIGP      |
|          10 | event_6_GenericCSV061418_16_0728_xmRhP   | 3hlB      |
.............

My FlumeData in hdfs:

-rw-r--r--   1 flume cloudera       4784 2018-06-14 17:03 /user/cloudera/flume/FlumeData.1528988593838.tmp

[root@quickstart ~]# hadoop fs -cat  /user/cloudera/flume/FlumeData.1528988593838.tmp | more

1;event_1_GenericCSV061418_16_0728_xmRhP;FXDO
2;event_2_GenericCSV061418_16_0728_xmRhP;h5hJ
3;event_26_GenericCSV061418_16_0728_xmRhP;e47R
4;event_3_GenericCSV061418_16_0728_xmRhP;aIvL
5;event_27_GenericCSV061418_16_0728_xmRhP;H7yi
6;event_4_GenericCSV061418_16_0728_xmRhP;guco
7;event_28_GenericCSV061418_16_0728_xmRhP;ECK6
8;event_5_GenericCSV061418_16_0728_xmRhP;Jt03
9;event_29_GenericCSV061418_16_0728_xmRhP;wIGP
10;event_6_GenericCSV061418_16_0728_xmRhP;3hlB
.........

My log file from agent flume:

2018-06-14 17:03:12,061 INFO org.apache.flume.node.PollingPropertiesFileConfigurationProvider: Configuration provider starting
2018-06-14 17:03:12,075 INFO org.apache.flume.node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:/var/run/cloudera-scm-agent/process/1235-flume-AGENT/flume.conf
2018-06-14 17:03:12,080 INFO org.apache.flume.conf.FlumeConfiguration: Processing:shdfs1
2018-06-14 17:03:12,081 INFO org.apache.flume.conf.FlumeConfiguration: Added sinks: shdfs1 Agent: agent
2018-06-14 17:03:12,081 INFO org.apache.flume.conf.FlumeConfiguration: Processing:k1
2018-06-14 17:03:12,081 INFO org.apache.flume.conf.FlumeConfiguration: Processing:k1
2018-06-14 17:03:12,081 INFO org.apache.flume.conf.FlumeConfiguration: Processing:k1
2018-06-14 17:03:12,081 INFO org.apache.flume.conf.FlumeConfiguration: Processing:shdfs1
2018-06-14 17:03:12,081 INFO org.apache.flume.conf.FlumeConfiguration: Processing:k1
2018-06-14 17:03:12,081 INFO org.apache.flume.conf.FlumeConfiguration: Processing:shdfs1
2018-06-14 17:03:12,081 INFO org.apache.flume.conf.FlumeConfiguration: Processing:shdfs1
2018-06-14 17:03:12,083 INFO org.apache.flume.conf.FlumeConfiguration: Processing:shdfs1
2018-06-14 17:03:12,084 INFO org.apache.flume.conf.FlumeConfiguration: Processing:shdfs1
2018-06-14 17:03:12,084 INFO org.apache.flume.conf.FlumeConfiguration: Processing:shdfs1
2018-06-14 17:03:12,084 INFO org.apache.flume.conf.FlumeConfiguration: Processing:shdfs1
2018-06-14 17:03:12,094 INFO org.apache.flume.conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [agent]
2018-06-14 17:03:12,095 INFO org.apache.flume.node.AbstractConfigurationProvider: Creating channels
2018-06-14 17:03:12,100 INFO org.apache.flume.channel.DefaultChannelFactory: Creating instance of channel ch1 type memory
2018-06-14 17:03:12,105 INFO org.apache.flume.node.AbstractConfigurationProvider: Created channel ch1
2018-06-14 17:03:12,107 INFO org.apache.flume.source.DefaultSourceFactory: Creating instance of source sql1, type org.keedio.flume.source.SQLSource
2018-06-14 17:03:12,109 INFO org.keedio.flume.source.SQLSource: Reading and processing configuration values for source sql1
2018-06-14 17:03:12,253 INFO org.hibernate.annotations.common.Version: HCANN000001: Hibernate Commons Annotations {4.0.5.Final}
2018-06-14 17:03:12,263 INFO org.hibernate.Version: HHH000412: Hibernate Core {4.3.10.Final}
2018-06-14 17:03:12,266 INFO org.hibernate.cfg.Environment: HHH000206: hibernate.properties not found
2018-06-14 17:03:12,268 INFO org.hibernate.cfg.Environment: HHH000021: Bytecode provider name : javassist
2018-06-14 17:03:12,295 INFO org.keedio.flume.source.HibernateHelper: Opening hibernate session
2018-06-14 17:03:12,418 INFO org.hibernate.engine.jdbc.connections.internal.ConnectionProviderInitiator: HHH000130: Instantiating explicit connection provider: org.hibernate.connection.C3P0ConnectionProvider
2018-06-14 17:03:12,434 INFO org.hibernate.c3p0.internal.C3P0ConnectionProvider: HHH010002: C3P0 using driver: com.mysql.jdbc.Driver at URL: jdbc:mysql://127.0.0.1:3306/testdb
2018-06-14 17:03:12,434 INFO org.hibernate.c3p0.internal.C3P0ConnectionProvider: HHH000046: Connection properties: {user=root, password=****, autocommit=true}
2018-06-14 17:03:12,435 INFO org.hibernate.c3p0.internal.C3P0ConnectionProvider: HHH000006: Autocommit mode: true
2018-06-14 17:03:12,467 INFO com.mchange.v2.log.MLog: MLog clients using log4j logging.
2018-06-14 17:03:12,705 INFO com.mchange.v2.c3p0.C3P0Registry: Initializing c3p0-0.9.2.1 [built 20-March-2013 10:47:27 +0000; debug? true; trace: 10]
2018-06-14 17:03:12,882 INFO com.mchange.v2.c3p0.impl.AbstractPoolBackedDataSource: Initializing c3p0 pool... com.mchange.v2.c3p0.PoolBackedDataSource@af852b02 [ connectionPoolDataSource -> com.mchange.v2.c3p0.WrapperConnectionPoolDataSource@88568b28 [ acquireIncrement -> 3, acquireRetryAttempts -> 30, acquireRetryDelay -> 1000, autoCommitOnClose -> false, automaticTestTable -> null, breakAfterAcquireFailure -> false, checkoutTimeout -> 0, connectionCustomizerClassName -> null, connectionTesterClassName -> com.mchange.v2.c3p0.impl.DefaultConnectionTester, debugUnreturnedConnectionStackTraces -> false, factoryClassLocation -> null, forceIgnoreUnresolvedTransactions -> false, identityToken -> 1hgec119v1ypcfct1ao3nvl|872bd98, idleConnectionTestPeriod -> 0, initialPoolSize -> 1, maxAdministrativeTaskTime -> 0, maxConnectionAge -> 0, maxIdleTime -> 0, maxIdleTimeExcessConnections -> 0, maxPoolSize -> 10, maxStatements -> 0, maxStatementsPerConnection -> 0, minPoolSize -> 1, nestedDataSource -> com.mchange.v2.c3p0.DriverManagerDataSource@dd6fcfe1 [ description -> null, driverClass -> null, factoryClassLocation -> null, identityToken -> 1hgec119v1ypcfct1ao3nvl|5b6e568, jdbcUrl -> jdbc:mysql://127.0.0.1:3306/testdb, properties -> {user=******, password=******, autocommit=true} ], preferredTestQuery -> null, propertyCycle -> 0, statementCacheNumDeferredCloseThreads -> 0, testConnectionOnCheckin -> false, testConnectionOnCheckout -> false, unreturnedConnectionTimeout -> 0, usesTraditionalReflectiveProxies -> false; userOverrides: {} ], dataSourceName -> null, factoryClassLocation -> null, identityToken -> 1hgec119v1ypcfct1ao3nvl|5fe1ba04, numHelperThreads -> 3 ]
2018-06-14 17:03:13,213 INFO org.hibernate.dialect.Dialect: HHH000400: Using dialect: org.hibernate.dialect.MySQL5Dialect
2018-06-14 17:03:13,224 INFO org.hibernate.engine.jdbc.internal.LobCreatorBuilder: HHH000424: Disabling contextual LOB creation as createClob() method threw error : java.lang.reflect.InvocationTargetException
2018-06-14 17:03:13,275 INFO org.hibernate.engine.transaction.internal.TransactionFactoryInitiator: HHH000399: Using default transaction strategy (direct JDBC transactions)
2018-06-14 17:03:13,280 INFO org.hibernate.hql.internal.ast.ASTQueryTranslatorFactory: HHH000397: Using ASTQueryTranslatorFactory
2018-06-14 17:03:13,539 INFO org.apache.flume.sink.DefaultSinkFactory: Creating instance of sink: shdfs1, type: hdfs
2018-06-14 17:03:13,547 INFO org.apache.flume.node.AbstractConfigurationProvider: Channel ch1 connected to [sql1, shdfs1]
2018-06-14 17:03:13,554 INFO org.apache.flume.node.Application: Starting new configuration:{ sourceRunners:{sql1=PollableSourceRunner: { source:org.keedio.flume.source.SQLSource{name:sql1,state:IDLE} counterGroup:{ name:null counters:{} } }} sinkRunners:{shdfs1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@185e5523 counterGroup:{ name:null counters:{} } }} channels:{ch1=org.apache.flume.channel.MemoryChannel{name: ch1}} }
2018-06-14 17:03:13,554 INFO org.apache.flume.node.Application: Starting Channel ch1
2018-06-14 17:03:13,557 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: CHANNEL, name: ch1: Successfully registered new MBean.
2018-06-14 17:03:13,557 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: ch1 started
2018-06-14 17:03:13,557 INFO org.apache.flume.node.Application: Starting Sink shdfs1
2018-06-14 17:03:13,559 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: SINK, name: shdfs1: Successfully registered new MBean.
2018-06-14 17:03:13,559 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: SINK, name: shdfs1 started
2018-06-14 17:03:13,562 INFO org.apache.flume.node.Application: Starting Source sql1
2018-06-14 17:03:13,563 INFO org.keedio.flume.source.SQLSource: Starting sql source sql1 ...
2018-06-14 17:03:13,563 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: SOURCESQL.sql1: Successfully registered new MBean.
2018-06-14 17:03:13,563 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: SOURCESQL.sql1 started
2018-06-14 17:03:13,595 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2018-06-14 17:03:13,645 INFO org.mortbay.log: jetty-6.1.26.cloudera.4
2018-06-14 17:03:13,801 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:41415
2018-06-14 17:03:13,837 INFO org.apache.flume.sink.hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
2018-06-14 17:03:14,027 INFO org.apache.flume.sink.hdfs.BucketWriter: Creating hdfs://127.0.0.1/user/cloudera/flume/FlumeData.1528988593838.tmp

best, Luis

Yes, it works , if your incremental column is type of integer. And incremental column also should be in sequence like 1,2,3,4,5.
But what If I specify an incremental column of timestamp type??
If incremental column is timestamp type then it will not work.