Utilities for working with Java Beam Sources/Sinks from python
Currently this is a holding place for some example code for accessing BigTable via python in Apache Beam/Dataflow jobs.
Why do we have to build any java code to use the java bigtable source?
Well in order to easily reference java Transforms from python you need a simple java interface. Ideally one factory function that just requires primitive types, or possibly a primitive type with a very simple configuration object. See Creating cross language java transforms for more info.
The other thing you need to do is use a serialization method that both java and python
understand. By default java often uses Serializable
and python uses pickle
. This
repo contains some supporting code to utilize Beam schemas and convert a bigtable result to a beam Row
using schemas. Rows are supported in python/java. Each java bigtable Result
is transformed into a Schema that looks something like:
message CellValue {
bytes family = 0;
bytes qualifier = 1;
bytes value = 2;
Timestamp timestamp = 3;
}
message Result {
repeated Cellvalue result = 0;
}
So in python you can access these results like so:
# this would be in something passed to beam.Map(lambda val: ...)
for r in val.result:
# make json serializable
family = r.family.decode("utf-8")
qualifier = r.qualifier.decode("utf-8")
ts = r.timestamp.micros*1.0/1000/1000 # convert from integer microseconds to fractional seconds
cell = {
"family": family,
"qualifier": qualifier,
"timestamp": ts,
"value":r.value.decode("utf-8")
}
To build the library run make
. This essentially just does a mvn install
and creates a classpath
file that you'll need later.
Ensure you have dependencies installed using whatever dependency manager you use. When in doubt: pip install apache-beam[gcp]
See python/bigtableexport.py for example of using java transform. You can use this transform to test your bigtable to gcs export or as a starting point. The pipeline can be invoked like so:
OUTPUT=gs://your-bucket/prefix
PROJECT=your-project
BIGTABLE_INSTANCE=your-bigtable-instance
BIGTABLE_TABLE=your-bigtable-table
python python/bigtableexport.py \
--bigtableProjectId $PROJECT \
--bigtableInstanceId $BIGTABLE_INSTANCE \
--bigtableTableId $BIGTABLE_TABLE \
--output $OUTPUT --classpath="$(cat classpath)" --runner=direct
You can view your results via gsutil:
gsutil cat ${OUTPUT}* | head
Expected results: 1 line of JSON per Result. Which is an array of Cells. Note that your results won't match those below because you are dumping from your table. You will need to have an actual bigtable instance with data in order to test.
[{"family": "column", "qualifier": "greeting", "timestamp": 1668719170.905, "value": "Hello World!"}, {"family": "column", "qualifier": "greeting", "timestamp": 1668719033.208, "value": "Hello World!"}]
[{"family": "column", "qualifier": "greeting", "timestamp": 1668719171.015, "value": "Hello Cloud Bigtable!"}, {"family": "column", "qualifier": "greeting", "timestamp": 1668719033.209, "value": "Hello Cloud Bigtable!"}]
The python/bigtableexport.py file shows that is needed to utilize a java external transform. It's only 3 small things:
- Add the import of JavaExternalTransform
- Define a classpath argument to your pipeline
- Add the JavaExternalTransform to your pipeline with the classpath defined above
# place this line with the rest of your imports
from apache_beam.transforms.external import JavaExternalTransform
# you'll need to pass the classpath file in that was generated above. Add this where you pass your arguments
parser.add_argument(
'--classpath',
dest='classpath',
required=True,
help='list of jars required. eg a.jar:b.jar:c.jar'
)
# And finally use the transform in your pipeline
with beam.Pipeline(options=pipeline_options) as pipe:
pipe | 'Read' \
>> JavaExternalTransform(
'ai.shane.bigtableshim.BigtableShim',
classpath=classpath).From(
known_args.bigtableProjectId,
known_args.bigtableInstanceId,
known_args.bigtableTableId)
The java transform utilizes CloudBigtableIO which supports many options not specified here such as Scan
config, filters, and key ranges.
If you get errors about the jvm not starting. It's possible you are using an older version of Beam which generates invalid jars for long classpaths.