imri / mizo

Super-fast Spark RDD for Titan Graph Database on HBase

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How can I get Titan vertices from HBase directly using Apache Spark?

ChaohsinChan opened this issue · comments

I am running Titan 1.0 with HBase 1.0.3 backend.I want to get the Titan vertices from HBase directly using Apache Spark 1.6.1 ,can you give me some advice? Thanks

Hey,

You can run the following code in order to retrieve the vertices. For example, let's count how many vertices you have on your graph.

import mizo.rdd.MizoBuilder;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;

public class MizoVerticesCounter {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName("Mizo Vertices Counter")
                .setMaster("local[1]")
                .set("spark.executor.memory", "4g")
                .set("spark.executor.cores", "1")
                .set("spark.rpc.askTimeout", "1000000")
                .set("spark.rpc.frameSize", "1000000")
                .set("spark.network.timeout", "1000000")
                .set("spark.rdd.compress", "true")
                .set("spark.core.connection.ack.wait.timeout", "6000")
                .set("spark.driver.maxResultSize", "100m")
                .set("spark.task.maxFailures", "20")
                .set("spark.shuffle.io.maxRetries", "20");

        SparkContext sc = new SparkContext(conf);

        long count = new MizoBuilder()
                .titanConfigPath("titan-graph.properties")
                .regionDirectoriesPath("hdfs://my-graph/*/e") // HDFS path to your HBase Table
                .parseInEdges(v -> false)
                .verticesRDD(sc)
                .toJavaRDD()
                .count(); // total number of vertices in your graph

        System.out.println("Vertices count is: " + count);
    }
}

Change 'hdfs://my-graph/*/e' to the HDFS path of your HBase Table.

Let me know if you have any further questions.

Thank you for your reply. I have two suggestions.
First, whether we can get the HDFS path through the HBase interface, which is more convenient to use, usually, we only know that HBase table name and it's configurations.
Second, whether the project can be converted to Maven management, which can also be developed inside the Eclipse. For those who are not familiar with Idea, it would take a long time to build up the development environment.

Thanks for your suggestions -

Regarding the Table name, I generally prefer not to rely on Hadoop config files, but rather specify paths directly.

Regarding Maven - good advice, I will switch to Maven and reupload soon.

Did you manage to run the code eventually?

I am not very familiar with Idea, so until now has not set up a good development environment. Can you give me some advice?

You only have to open the root directory in IntelliJ, then go to MizoEdgesCounter, tight click and debug.

When I import a project to Idea, choosing to create a project from an existing source will prompt me that the project file already exists and that other errors will occur when I choose to overwrite it. I do not know why.But if I choose Import a project from an existing model,only Eclipse,Gradle,Maven can choose.So I still did not succeed.

Try using File > Open and choose the project iml file

Thank you for your suggestion, I am left with a last problem.
Module mizo-core: invalid item 'com.google.guava:guava:19.0' in the dependencies list
Module mizo-core: invalid item 'com.thinkaurelius.titan:titan-core:1.0.0' in the dependencies list
How do I introduce these dependencies? And Hbase and Spark without these dependency problems.

These dependencies should come from Maven. I see that the POMs are not included in the repo, I will add them in 12 hours.

OK,Thanks. I find the files titan-graph.properties and log4j.properties are also missing,you can add them together.

I find an error:
Exception in thread "main" java.lang.IllegalArgumentException: Could not find implementation class: com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager.

I suspect that this problem is about the config titan-graph.properties,can you show your config to me?

storage.backend=hbase
storage.hostname=hlg-3p163-wangyongzhi,hlg-3p190-wangyongzhi,hlg-3p166-wangyongzhi
storage.hbase.table=titandb
storage.hbase.ext.zookeeper.znode.parent=/hbase-unsecure
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5
index.search.backend=elasticsearch
index.search.hostname=127.0.0.1
index.search.elasticsearch.client-only=true

I wonder this configuration is not right. I just copy them from Titan configuration .

Add:
storage.hbase.compat-class = com.thinkaurelius.titan.diskstorage.hbase.HBaseCompat1_0

It does not work,should I need other dependencies?

Let me build it myself and I will upload it as a complete Maven project. Will update you soon.

OK,thanks

All the problems are solved by me, and now to the last step, but there was a mistake:

Exception in thread "main" java.lang.ClassCastException: com.thinkaurelius.titan.graphdb.types.VertexLabelVertex cannot be cast to com.thinkaurelius.titan.graphdb.internal.InternalRelationType
at mizo.rdd.MizoRDD.lambda$loadRelationTypes$3(MizoRDD.java:146)
at java.lang.Iterable.forEach(Iterable.java:75)

Would you give me some advice?

public class MizoEdgesCounter {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "C:\F盘\hadoop-2.6.0.tar\hadoop-2.6.0\hadoop-2.6.0");
SparkConf conf = new SparkConf()
.setAppName("Mizo Edges Counter")
.setMaster("local[1]")
.set("spark.executor.memory", "4g")
.set("spark.executor.cores", "1")
.set("spark.rpc.askTimeout", "1000000")
.set("spark.rpc.frameSize", "1000000")
.set("spark.network.timeout", "1000000")
.set("spark.rdd.compress", "true")
.set("spark.core.connection.ack.wait.timeout", "6000")
.set("spark.driver.maxResultSize", "100m")
.set("spark.task.maxFailures", "20")
.set("spark.shuffle.io.maxRetries", "20");

    SparkContext sc = new SparkContext(conf);

    long count = new MizoBuilder()
            .logConfigPath("C:\\ideapluin\\mizo-master\\mizo-master\\target\\test\\mizo-rdd\\log4j.properties")
            .titanConfigPath("C:\\ideapluin\\mizo-master\\mizo-master\\target\\test\\mizo-rdd\\titan-graph.properties")
            .regionDirectoriesPath("hdfs://hlg-3p163-wangyongzhi:8020/apps/hbase/data/data/default/titandb6/8f68e1d6f9d35a4683e1a4c264cd669f/e")
            .parseInEdges(v -> false)
            .edgesRDD(sc)
            .toJavaRDD()
            .count();

    System.out.println("Edges count is: " + count);
}

}

I did not modify your code. This error occured here:
` protected static HashMap<Long, InternalRelationType> loadRelationTypes(String titanConfigPath) {
TitanGraph g = TitanFactory.open(titanConfigPath);
StandardTitanTx tx = (StandardTitanTx)g.newTransaction();

    HashMap<Long, InternalRelationType> relations = Maps.newHashMap();

    tx.query()
            .has(BaseKey.SchemaCategory, Contain.IN, Lists.newArrayList(TitanSchemaCategory.values()))
            .vertices()
            .forEach(v -> relations.put(v.longId(), new MizoTitanRelationType((InternalRelationType)v)));

    g.close();

    return relations;
}`

The problem above was solved, but there was aslo a mistake:
java.lang.IllegalArgumentException: Invalid ASCII encoding offset: 625 at com.thinkaurelius.titan.graphdb.database.serialize.attribute.StringSerializer.read(StringSerializer.java:105) at mizo.hbase.MizoTitanHBaseRelationParser.readPropertyValue(MizoTitanHBaseRelationParser.java:179) at mizo.iterators.MizoBaseRelationsIterator.handleProperty(MizoBaseRelationsIterator.java:87) at mizo.iterators.MizoBaseRelationsIterator.getEdgeOrNull(MizoBaseRelationsIterator.java:46)

I use the Titan example,Graph Of The Gods,you can see here http://s3.thinkaurelius.com/docs/titan/1.0.0/getting-started.html

Fixed the bug - checked using the Graph of the Gods, works :)
Also updated the project to use Maven

Let me know if it works for you.

There was aslo a mistake,how can I resolve it? Should be guava version of the conflict

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedTime(Ljava/util/concurrent/TimeUnit;)J at com.google.common.cache.LocalCache$LoadingValueReference.elapsedNanos(LocalCache.java:3600) at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2412) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2373) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2335) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2250) at com.google.common.cache.LocalCache.get(LocalCache.java:3985) at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4788) at com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx$6$6.call(StandardTitanTx.java:1244) at com.thinkaurelius.titan.graphdb.query.QueryUtil.processIntersectingRetrievals(QueryUtil.java:268) at com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx$6.execute(StandardTitanTx.java:1258) at com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx$6.execute(StandardTitanTx.java:1126) at com.thinkaurelius.titan.graphdb.query.QueryProcessor$LimitAdjustingIterator.getNewIterator(QueryProcessor.java:198) at com.thinkaurelius.titan.graphdb.query.LimitAdjustingIterator.hasNext(LimitAdjustingIterator.java:54) at com.thinkaurelius.titan.graphdb.query.ResultSetIterator.nextInternal(ResultSetIterator.java:40) at com.thinkaurelius.titan.graphdb.query.ResultSetIterator.<init>(ResultSetIterator.java:30) at com.thinkaurelius.titan.graphdb.query.QueryProcessor.iterator(QueryProcessor.java:57) at com.google.common.collect.Iterables$7.iterator(Iterables.java:613) at java.lang.Iterable.forEach(Iterable.java:74) at mizo.rdd.MizoRDD.loadRelationTypes(MizoRDD.java:149) at mizo.rdd.MizoRDD.<init>(MizoRDD.java:71) at mizo.rdd.MizoBuilder$1.<init>(MizoBuilder.java:53) at mizo.rdd.MizoBuilder.edgesRDD(MizoBuilder.java:53) at MizoEdgesCounter.main(MizoEdgesCounter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

This error is caused because there is a mismatch between Titan and other components version of Guava.

I succeed to run the code for HBase 1.0.3 -- try to checkout the code into a new directory and run it from there, without any modifications. Should work

When I run it without any modifications,there was an error here:
Exception in thread "main" java.lang.IllegalArgumentException: Could not find implementation class: com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager
at com.thinkaurelius.titan.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:47)
at com.thinkaurelius.titan.diskstorage.Backend.getImplementationClass(Backend.java:473)
at com.thinkaurelius.titan.diskstorage.Backend.getStorageManager(Backend.java:407)
at com.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration.(GraphDatabaseConfiguration.java:1320)
at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:94)
at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:62)
at mizo.rdd.MizoRDD.loadRelationTypes(MizoRDD.java:141)

Pushed an update for fixing this, try now - working for me

I get result,but there was an error here when the job completed:

27490 [main] INFO org.apache.spark.scheduler.DAGScheduler - Job 0 finished: count at MizoEdgesCounter.java:34, took 2.037018 s
Edges count is: 34

27871 [DestroyJavaVM] WARN com.thinkaurelius.titan.graphdb.database.StandardTitanGraph - Unable to remove graph instance uniqueid c0a8adc387204-DE0018-PC1 com.thinkaurelius.titan.core.TitanException: Could not execute operation due to backend exception at com.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:44) at com.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:144) at com.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration.set(KCVSConfiguration.java:141) at com.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration.set(KCVSConfiguration.java:118) at com.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration.remove(KCVSConfiguration.java:159) at com.thinkaurelius.titan.diskstorage.configuration.ModifiableConfiguration.remove(ModifiableConfiguration.java:42) at com.thinkaurelius.titan.graphdb.database.StandardTitanGraph.closeInternal(StandardTitanGraph.java:191) at com.thinkaurelius.titan.graphdb.database.StandardTitanGraph.access$600(StandardTitanGraph.java:78) at com.thinkaurelius.titan.graphdb.database.StandardTitanGraph$ShutdownThread.start(StandardTitanGraph.java:803) at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:102) at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) at java.lang.Shutdown.runHooks(Shutdown.java:123) at java.lang.Shutdown.sequence(Shutdown.java:167) at java.lang.Shutdown.shutdown(Shutdown.java:234) Caused by: com.thinkaurelius.titan.diskstorage.PermanentBackendException: Permanent exception while executing backend operation setConfiguration at com.thinkaurelius.titan.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:69) at com.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:42) ... 13 more Caused by: java.lang.IllegalArgumentException: Connection is null or closed. at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:310) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getTable(ConnectionManager.java:712) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getTable(ConnectionManager.java:694) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getTable(ConnectionManager.java:532) at com.thinkaurelius.titan.diskstorage.hbase.HConnection1_0.getTable(HConnection1_0.java:22) at com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager.mutateMany(HBaseStoreManager.java:424) at com.thinkaurelius.titan.diskstorage.hbase.HBaseKeyColumnValueStore.mutateMany(HBaseKeyColumnValueStore.java:189) at com.thinkaurelius.titan.diskstorage.hbase.HBaseKeyColumnValueStore.mutate(HBaseKeyColumnValueStore.java:88) at com.thinkaurelius.titan.diskstorage.locking.consistentkey.ExpectedValueCheckingStore.mutate(ExpectedValueCheckingStore.java:65) at com.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration$2.call(KCVSConfiguration.java:146) at com.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration$2.call(KCVSConfiguration.java:141) at com.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:133) at com.thinkaurelius.titan.diskstorage.util.BackendOperation$1.call(BackendOperation.java:147) at com.thinkaurelius.titan.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:56) ... 14 more

I will fix it soon. Did you succeed?

Yes! In addition to the above error, I have to get the results, it is not easy!

I will traverse all the vertex information soon, check the vertex information is correct or not.

Ok keep me updated :)

How can I bulk import data to Titan, can you give me some advice? I have 100GB of data. Thanks.

Hey,
Create a new transaction that uses batches (TitanGraph.buildTransaction().enableBatchLoading().checkExternalVertexExistence(false)), then commit() the transaction every X insertions, for example 50k.

Hello imri,
Thank you for the great work on mizo.
I meet same problems described in the questions in stackoverflow:
Q1: http://stackoverflow.com/questions/41121262/reading-a-large-graph-from-titan-on-hbase-into-spark?rq=1
Q2:http://stackoverflow.com/questions/35464538/how-to-process-large-titan-graph-using-spark
Until now, i can't find good practice by Titan with Spark for OLAP.
Do you have tried to directly use SparkGraphComputer to do OLAP? do you have any example codes?
In the TitanBlueprintsGraph.java file,when override the computer method:

@Override public <C extends GraphComputer> C compute(Class<C> graphComputerClass) throws IllegalArgumentException { if (!graphComputerClass.equals(FulgoraGraphComputer.class)) { throw Graph.Exceptions.graphDoesNotSupportProvidedGraphComputer(graphComputerClass); } else { return (C)compute(); } }
So i think,when i create TitanGraph,it don't support SparkGraphComputer, I can only create hadoopgraph by graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties'), how can it do the tranversal of Titan graph DB? I can't find how it scan the HBase tables.
Can you have any example code for SparkGraphComputer work with titan?

Thank you very much.

Hey,

This answer might be helpful.

I have used SparkGraphComputer using Titan, but I malfunctions, and is really buggy. In order for this to work, you have to use HadoopGraph (as specified in the answer above), which internally uses an InputFormat to read the graph. Titan's implementation of InputFormat was buggy - first of all, it skips vertices (if you count the number of vertices using the InputFormat, you get a wrong answer). Second, it crashes in some circumstances (for example, an edge that connected vertex to itself). Third, SparkGraphComputer is really really slow - I haven't researched why. To sum up - as far as I'm concerned - SparkGraphComputer is bad.

What are you trying to achieve? Tell me more, maybe we can figure it out using Mizo.

Best regards

Thank you very much! So excited that you answered me. (Please ignore my english grammatical errors).
Now i am trying to use Titan to store some relation data about users,user follow relation, user's goods for sell (Second hand). And then i want to do some OLAP analyze to do some relation recommend,goods recommend, user cluster divide and so on.
For example:
Case1: A follow B, B follow C, and maybe A will be interesting with C.
Case2: I want to find why and how users follow another one, if there are any common features.

Now,I have already build my Titan Cluster using HBase + ElasticSearch as backend for OLTP service, and i am trying to build my OLAP environment based on Titan and Spark,but found there is no good document. And even Titan don't support Spark well.

When i found the mizo project, i think maybe i can do OLAP on Spark GraphX. I mean, i only scan my Titan Hbase table for all vertices and edges into Spark, and use Spark GraphX to do the analyze. Is this possible?

Thank you again !

So if I get you right, you are willing to expand from a given vertex through multiple hops. Mizo only allows you to expand from a given vertex to its direct edges.

I haven't used GraphX, but as far as I'm concerned, it should be really easy to integrate Mizo with it, since it only expects an RDD of edges, so you can convert Mizo EdgesRDD to a RDD of GraphX edges. I'm not sure what you'll be able to achieve using GraphX, but give it a try.

If you need any help, let me know.

Thank you,i will have a try.

Hello imri,
I have started a spark OLAP task based on Titan &Hbase & Gremlin Spark Computer, But as your experiments, it works very slow, when i have 150 Vertex in the graph,it costs 4 minutes,and when there are 10millon vertex, it cost too long time.
Here it seems stop in readRDD from Titan.
image

My Hbase version is 0.94,but i found in mizo, it depends 1.0.2 hbase client. and my Hbase in production envrionment don't allow me directly read HFiles...

I am trying to solve these problems.

PS:I have a questions about using Titan, is is there any way to create the property key first, commit and then later do indexing? Because when i write properties without create index(using eslatic search),it have errors.

Hello,
I have successfully run the edges and vertices count test case user Mizo! Thank you. I am using hbase 0.98,spark1.5.1 and the Titan's God graph.
I still have some questions,the vertices count is not right, there are 17 edges,but the mizo count result eges coutn is 32. it is not 17*2.
Then i build a very simple graph, only 3 vertices, And after my test by mizo, it found the vertices count is 10, there are 7 non-related vertices, i think these edges may be index or some internal use vertices in Titan. i think this maybe related with 'Multiple Item Data Model(ref:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.TitanDB.BestPractices.html )',because when i scan my the table by Hbase shell, the same rowkey with more values.
image

  1. in the MizoRDD.java file, when loading relationType, why these label configured for Vertices ignored. If i need the vertices label info it is impossible.
protected static HashMap<Long, MizoTitanRelationType> loadRelationTypes(String titanConfigPath)
{
    ...
                .forEach(v -> {
                    if (v instanceof InternalRelationType)
                        relations.put(v.longId(), new MizoTitanRelationType((InternalRelationType)v));
                });
}
  1. when i use Hbase v0.98, In the MizoRegionFamilyCellsIterator.java, in the ASC_CELL_COMPARATOR, there are no CellComparator.compareRows and compareTimestamps method. so i changed to them to compareStatic,like follow code.
    private Comparator<Cell> ASC_CELL_COMPARATOR = (left, right) -> {
        int c = CellComparator.compareStatic(left, right);
        if (c != 0) {
            return c;
        } else {
            if (left.getFamilyLength() + left.getQualifierLength() == 0 &&
                    left.getTypeByte() == KeyValue.Type.Minimum.getCode()) {
                return 1;
            } else if (right.getFamilyLength() + right.getQualifierLength() == 0 &&
                    right.getTypeByte() == KeyValue.Type.Minimum.getCode()) {
                return -1;
            } else {
                boolean sameFamilySize = left.getFamilyLength() == right.getFamilyLength();
                if (!sameFamilySize) {
                    return Bytes.compareTo(left.getFamilyArray(), left.getFamilyOffset(), left.getFamilyLength(),
                            right.getFamilyArray(), right.getFamilyOffset(), right.getFamilyLength());
                } else {
                    int diff = CellComparator.compareStatic(left, right);
                    if (diff != 0) {
                        return diff;
                    } else {
                        c = Longs.compare(right.getTimestamp(), left.getTimestamp());
                        if (c != 0) diff=c;
                        //diff = CellComparator.compareTimestamps(right, left); // Different from CellComparator.compare()
                        return diff != 0 ? diff : (255 & right.getTypeByte()) - (255 & left.getTypeByte());
                    }
                }
            }
        }
    };

i am not quite under this part, why need Creates an ascending-sorted cells iterator, what does Cell mean, it is a properties or Edge in the one row?
Any suggested document for me to understand Htable,regionfamily,cell etc.
Any suggested document for me to understand Titan datamodule?