frankyu8 / ushas

Ushas

Description

Ushas is a component that is packaged on the basis of spark to strengthen the governance of data lineage. Although the table-level lineage judgment for spark in traditional data governance can solve the dependence of data to a certain extent, it is difficult to identify the relationship between fields as accurate as possible. The purpose of developing this component is to strengthen spark's tracking advantage in column-level lineage. Ushas represents that what we pursue is not just a simple judgment, but the ability to accurately capture relationship.

Pave the way for knowledge

Realization of logical plan in dataset

Ushas is mainly modified in the spark-sql-catalyst module and spark-sql-hive module. The catalyst is mainly responsible for the relationship dependency management of spark in data processing. For normal datasets, the following code will be used to embed the logical plan processing in the object:
```
Dataset.ofRows(sparkSession, logicalPlan)
```

Implementation of logical plan in sql (What parser do)

The support for spark-sql above version 2.0 is to perform syntax analysis through Antlr4 to generate a syntax tree, and then process syntax information which unresolved into resolved information through a deep traversal method. Each spark sql will execute parse through SparkSqlParser in advance, SparkSqlParser adds the lexical and processor required by antlr4, and then generates a logical plan:

ushas/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

Lines 641 to 646 in ee36eac

    
             def sql(sqlText: String): DataFrame = { 
        
           //    这里默认先对sql进行parse，parse成逻辑计划之后再对逻辑计划进行解析，生成ResolvedPlan 
        
           //    这里的sessionstate会先去匹配初始化哪个session，如果匹配到hive，则使用HIVE_SESSION_STATE_BUILDER_CLASS_NAME，即hivesessionstate 
        
           //    sessionState里面包含了sqlParser等各种规则 
        
               Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText)) 
        
             }

The internal parsePlan will perform the operation of sql conversion logicalplan

ushas/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala

Lines 69 to 76 in ee36eac

    
           override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser => 
        
             astBuilder.visitSingleStatement(parser.singleStatement()) match { 
        
               case plan: LogicalPlan => plan 
        
               case _ => 
        
                 val position = Origin(None, None) 
        
                 throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position) 
        
             } 
        
           }

Analyzer

OfRows will trigger Analyzer to analyze the logical plan, and will call the batches in Analyzer to convert from UnresolveLogicalplan to ResolveLogicalplan (convert if there is, skip if not)

ushas/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala

Lines 72 to 80 in ee36eac

    
           def execute(plan: TreeType): TreeType = { 
        
             var curPlan = plan 
        
             val queryExecutionMetrics = RuleExecutor.queryExecutionMeter 
        
             batches.foreach { batch => 
        
               val batchStartPlan = curPlan 
        
               var iteration = 1 
        
               var lastPlan = curPlan 
        
               var continue = true

The specific identification rules of Analyzer are as follows:

ushas/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Lines 152 to 214 in a7066a6

    
           lazy val batches: Seq[Batch] = Seq( 
        
             Batch("Hints", fixedPoint, 
        
               new ResolveHints.ResolveBroadcastHints(conf), 
        
               ResolveHints.ResolveCoalesceHints, 
        
               ResolveHints.RemoveAllHints), 
        
             Batch("Simple Sanity Check", Once, 
        
               LookupFunctions), 
        
             Batch("Substitution", fixedPoint, 
        
               CTESubstitution, 
        
               WindowsSubstitution, 
        
               EliminateUnions, 
        
               new SubstituteUnresolvedOrdinals(conf)), 
        
             Batch("Resolution", fixedPoint, 
        
               ResolveTableValuedFunctions :: 
        
                 ResolveRelations :: 
        
                 ResolveReferences :: 
        
                 ResolveCreateNamedStruct :: 
        
                 ResolveDeserializer :: 
        
                 ResolveNewInstance :: 
        
                 ResolveUpCast :: 
        
                 ResolveGroupingAnalytics :: 
        
                 ResolvePivot :: 
        
                 ResolveOrdinalInOrderByAndGroupBy :: 
        
                 ResolveAggAliasInGroupBy :: 
        
                 ResolveMissingReferences :: 
        
                 ExtractGenerator :: 
        
                 ResolveGenerate :: 
        
                 ResolveFunctions :: 
        
                 ResolveAliases :: 
        
                 ResolveSubquery :: 
        
                 ResolveSubqueryColumnAliases :: 
        
                 ResolveWindowOrder :: 
        
                 ResolveWindowFrame :: 
        
                 ResolveNaturalAndUsingJoin :: 
        
                 ResolveOutputRelation :: 
        
                 ExtractWindowExpressions :: 
        
                 GlobalAggregates :: 
        
                 ResolveAggregateFunctions :: 
        
                 TimeWindowing :: 
        
                 ResolveInlineTables(conf) :: 
        
                 ResolveHigherOrderFunctions(catalog) :: 
        
                 ResolveLambdaVariables(conf) :: 
        
                 ResolveTimeZone(conf) :: 
        
                 ResolveRandomSeed :: 
        
                 TypeCoercion.typeCoercionRules(conf) ++ 
        
                   extendedResolutionRules: _*), 
        
             Batch("Post-Hoc Resolution", Once, postHocResolutionRules: _*), 
        
             Batch("View", Once, 
        
               AliasViewChild(conf)), 
        
             Batch("Nondeterministic", Once, 
        
               PullOutNondeterministic), 
        
             Batch("UDF", Once, 
        
               HandleNullInputsForUDF), 
        
             Batch("FixNullability", Once, 
        
               FixNullability), 
        
             Batch("Subquery", Once, 
        
               UpdateOuterReferences), 
        
             Batch("Cleanup", fixedPoint, 
        
               CleanupAliases), 
        
             Batch("LineageTrack", fixedPoint, 
        
               tailResolutionRules:_* 
        
             ) 
        
           )

The specific parsing method of a batch in Analyzer is to take a post-order traversal or a pre-order traversal to analyze each nested logic plan

def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp

ushas/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/AnalysisHelper.scala

Lines 85 to 104 in a7066a6

    
             def resolveOperatorsUp(rule: PartialFunction[LogicalPlan, LogicalPlan]): LogicalPlan = { 
        
               if (!analyzed) { 
        
                 AnalysisHelper.allowInvokingTransformsInAnalyzer { 
        
           //        _.resolveOperatorsUp(rule) 这里相当于传入一个匿名函数 
        
                   val afterRuleOnChildren = mapChildren(_.resolveOperatorsUp(rule)) 
        
                   if (self fastEquals afterRuleOnChildren) { 
        
                     CurrentOrigin.withOrigin(origin) { 
        
           //            applyOrElse接收2个参数，第一个是调用的参数，第二个是个回调函数。如果第一个调用的参数匹配，返回匹配的值，否则调用回调函数。 
        
                       rule.applyOrElse(self, identity[LogicalPlan]) 
        
                     } 
        
                   } else { 
        
                     CurrentOrigin.withOrigin(origin) { 
        
                       rule.applyOrElse(afterRuleOnChildren, identity[LogicalPlan]) 
        
                     } 
        
                   } 
        
                 } 
        
               } else { 
        
                 self 
        
               } 
        
             }

What we do

Let logicalplan have the ability of column-level analysis

In order to allow logicplan to have the ability to recognize column-level lineage, we first modified the abstract class of the logic plan (here in order not to affect the other functions of logicplan, so additional traits were added), so that when new batch which trying to deal with column-lineage is added to the Analyzer, all influencing factors are controlled within additional traits, and will not affect the normal logic plan analysis of spark itself

ushas/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

Lines 30 to 36 in a7066a6

    
           abstract class LogicalPlan 
        
             extends QueryPlan[LogicalPlan] 
        
             with AnalysisHelper 
        
             with LogicalPlanStats 
        
             with QueryPlanConstraints 
        
             with LineageHelper 
        
             with Logging {

How traits work

Here the attributes carried in [trait]LineageHelper include _lineageResolved (whether resolved by Rule), childrenLineageResolved (used to recursively determine whether all sub-logic plans have been resolved by Rule), markLineageResolved (used to mark the success of the current logic plan resolution)

ushas/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/lineageCollect/LineageHelper.scala

Lines 24 to 45 in a46317d

    
           trait LineageHelper extends QueryPlan[LogicalPlan] with LineageEntity{ 
        
             self: LogicalPlan => 
        
             private var _lineageResolved: Boolean = false 
        
             /** 
        
              * Returns true if  all the children of this expression have been resolved to a specific schema 
        
              * and false if any still contains any unresolved placeholders. 
        
              */ 
        
             def childrenLineageResolved: Boolean = children.forall(_.lineageResolved) 
        
             /** 
        
              * 判断下游的lineage是否已经被解析（即生成了对应的Col对象） 
        
              */ 
        
             lazy val lineageResolved: Boolean = childrenLineageResolved 
        
             /** 
        
              * 标记对象为已被解析 
        
              */ 
        
             def markLineageResolved(): Unit = { 
        
               _lineageResolved = true 
        
             } 
        
           }

All column-level objects will be taken over by the lineageChildren container.

The rationality of the existence of column-level objects

The logic plan actually operates on the operator, and does not operate on the members involved in the operator. Only a RelationLogicplan and ProjectLogicplan will be generated here with the sentence 'select a, b, c from table'. The corresponding projectList which contains all expressiones in ProjectLogicplan is what we care about. In order not to directly operate the expression in the projectList, we pre-defined the column-level objects (one-to-one correspondence with the expression), and simply record the contents of each expression The attribute we care about [extend treeNode] is then put into lineageChildren. As long as it is ensured that when the Analyzer is working, every time it is traversed in depth, the lineageChildren is copied from the node that is not involved in the calculation, and carried to the upper node, and the relationship is judged on the node that needs to be operated, so that the correct analysis of the column-level field can be ensured.

The subclasses of column-level objects include ExpressionColumn, RelationColumn and UnionColumn. ExpressionColumn mainly records the expression in the Project logical plan, RelationColumn mainly records the affiliation in LeafNode, and UnionColumn mainly records the special column field relationship because it needs to identify corresponding information of right lineageChildren. https://github.com/frankyu8/ushas/tree/main/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/lineage
How does the rule work

Because logicplan is encapsulated in all df objects, every time the df object performs a method operation, the logic plan will be forced to analyze the analyzer. We have added our own column-level lineage judgment rules to the Analyzer rules

ushas/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Lines 211 to 212 in a46317d

Batch("LineageTrack", fixedPoint,

tailResolutionRules:_*

We have added two new parsing rules, one is the parsing rule for Relation (that is, the leaf node judgment of the field blood), and the other is the parsing rule for Expression (that is, the judgment of all the intermediate relations of the field blood)

Relation's parsing rules：

ushas/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLineage.scala

Line 120 in 796fc00

class ResolveRelation extends Rule[LogicalPlan] {

Relation here is a simple affiliation judgment to record the attribute of each field which does not look for catalog in the logical plan.

Expression's parsing rules：

ushas/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLineage.scala

Line 29 in 796fc00

class ResolveExpression extends Rule[LogicalPlan] {

The analysis of Expression here uses the most basic map addressing, that is, the exprid of the upper field is matched with the exprid of the lower field, and if it is matched, it will be bound (borrowing the only feature of exprid)

Recognition of hive relation

For hive, the current development is to take all the hive information。If enablehive is started in sparksession, the default is to rewrite Analyzer, so in order to add support for hive data sources, we add tailResolutionRules to the Analyzer base class and rewrite it in the inherited analyzer.

Analyzer of the base class

ushas/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Lines 149 to 150 in 30faef4

    
           val tailResolutionRules: Seq[Rule[LogicalPlan]] = Seq(new ResolveLineage.ResolveRelation, 
        
           new ResolveLineage.ResolveExpression)

Rewrite Analyzer in hive's sessionStatebuilder

ushas/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala

Lines 69 to 94 in 30faef4

    
           override protected def analyzer: Analyzer = new Analyzer(catalog, conf) { 
        
             override val extendedResolutionRules: Seq[Rule[LogicalPlan]] = 
        
               new ResolveHiveSerdeTable(session) +: 
        
                 new FindDataSourceTable(session) +: 
        
                 new ResolveSQLOnFile(session) +: 
        
                 customResolutionRules 
        
             override val postHocResolutionRules: Seq[Rule[LogicalPlan]] = 
        
               new DetermineTableStats(session) +: 
        
                 RelationConversions(conf, catalog) +: 
        
                 PreprocessTableCreation(session) +: 
        
                 PreprocessTableInsertion(conf) +: 
        
                 DataSourceAnalysis(conf) +: 
        
                 HiveAnalysis +: 
        
                 customPostHocResolutionRules 
        
             override val extendedCheckRules: Seq[LogicalPlan => Unit] = 
        
               PreWriteCheck +: 
        
                 PreReadCheck +: 
        
                 customCheckRules 
        
             override val tailResolutionRules: Seq[Rule[LogicalPlan]] = 
        
               Seq(new ResolveHiveRelation, 
        
                 new ResolveLineage.ResolveExpression) 
        
           }

Software Architecture

[module]assembly
The assembly module is for more convenient access to the packaged content. The native code of spark is transplanted here, which can be automatically packaged with one click, and all jar packages are obtained in the target/scala directory.
[module]dev
The dev module is to configure checkstyle code specification detection. Spark has built-in scala code specification requirements. We also follow all his requirements here. The output directory is target/checkstyle-output.xml
[module]examples
The example module is to provide application examples of column-level lineage
[directory]sql
SQL contains all the analysis of spark catalyst, and the main work of the column-level lineage is concentrated on the three modules contained in SQL

Installation tutorial

Using 'Gnerated Source Code' with maven inside Idea on spark-catalyst module to generate syntax tree file through sqlbase.g4 .
When running the sample file, set the parameter -DLocal first, then set Include with provided scope, and run spark with the default local format and local package
Sample file location examples/src/main/scala/org/apache/spark/examples/lineage/SparkLineageExample.scala
When packaging, if you need to add hive plug-in support, you need to check hive in the spark profile
Because it is the separation in the spark project, you only need to replace spark-hive_2.12-3.1.2.jar and spark-catalyst_2.12-3.1.2.jar to complete the rapid deployment of the column-level blood relationship

Show results

Prepare sample sql ： select * from (select substr(a+1,0,1) as c,a+3 as d from (select 1 as a,2 as b))
Sample output：

c#2
+- c#2
   +- Alias ( substring(cast((a#0 + 1) as string), 0, 1) AS c#2 )
      +- a#0
         +- Alias ( 1 AS a#0 )

How to view column-level lineage in spark-shell

df.queryExecution.analyzed.lineageChildren(0).treeString
How to view column-level lineage in pyspark

df._jdf.queryExecution().analyzed().lineageChildren().apply(0).treeString()

Things you can do

Optimize the existing code structure. The current project code structure is consistent with the structure of spark2.4. Modules can be modified in a targeted manner, and the main code can be gathered together.
Add a new module. Currently, only Spark column-level lineage analysis is provided, but the automatic storage and display of lineage is a blank area left for the committee.
Existing rules search for logical optimization. The current search for corresponding column relations mainly relies on the global uniqueness of exprid. Here, because it is traversal search, it will increase the time consumption to see if optimization can be performed.
Spark currently relies on UnresolveLogicplan to replace resolveLogicplan to perform the resolve mark, and we simply mark it as True in the lineageresolve of the logic plan, and does not give a standardized sample class. The amount of engineering to optimize this area is very large.
There is an example of injection through spark plug-inization in the example. I hoped to modify the source code as little as possible and implement the rule rules externally, but the result is not satisfactory. You can try how to configure the plug-in.
At present, the data source of hive is addressed, that is to say, the current column-level lineage can accurately identify the data source of hive in the catalog, but other external data sources are not defined in any way. This will be done all by you.

Participate in contribution

Fork this repository
Add Feat_xxx branch
Submit code
New Pull Request
Help

How to associate issue and pr

https://docs.github.com/cn/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue

How to configure checkstyle code inspection

https://blog.csdn.net/qq_31424825/article/details/100050445

frankyu8 / ushas

Ushas

Description

Pave the way for knowledge

Realization of logical plan in dataset

Implementation of logical plan in sql (What parser do)

Analyzer

What we do

Let logicalplan have the ability of column-level analysis

How traits work

The rationality of the existence of column-level objects

How does the rule work

Recognition of hive relation

Software Architecture

[module]assembly

[module]dev

[module]examples

[directory]sql

Installation tutorial

Show results

Things you can do

Participate in contribution

About

Languages

	def sql(sqlText: String): DataFrame = {
	// 这里默认先对sql进行parse，parse成逻辑计划之后再对逻辑计划进行解析，生成ResolvedPlan
	// 这里的sessionstate会先去匹配初始化哪个session，如果匹配到hive，则使用HIVE_SESSION_STATE_BUILDER_CLASS_NAME，即hivesessionstate
	// sessionState里面包含了sqlParser等各种规则
	Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText))
	}

	override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser =>
	astBuilder.visitSingleStatement(parser.singleStatement()) match {
	case plan: LogicalPlan => plan
	case _ =>
	val position = Origin(None, None)
	throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
	}
	}

	def execute(plan: TreeType): TreeType = {
	var curPlan = plan
	val queryExecutionMetrics = RuleExecutor.queryExecutionMeter

	batches.foreach { batch =>
	val batchStartPlan = curPlan
	var iteration = 1
	var lastPlan = curPlan
	var continue = true

	lazy val batches: Seq[Batch] = Seq(
	Batch("Hints", fixedPoint,
	new ResolveHints.ResolveBroadcastHints(conf),
	ResolveHints.ResolveCoalesceHints,
	ResolveHints.RemoveAllHints),
	Batch("Simple Sanity Check", Once,
	LookupFunctions),
	Batch("Substitution", fixedPoint,
	CTESubstitution,
	WindowsSubstitution,
	EliminateUnions,
	new SubstituteUnresolvedOrdinals(conf)),
	Batch("Resolution", fixedPoint,
	ResolveTableValuedFunctions ::
	ResolveRelations ::
	ResolveReferences ::
	ResolveCreateNamedStruct ::
	ResolveDeserializer ::
	ResolveNewInstance ::
	ResolveUpCast ::
	ResolveGroupingAnalytics ::
	ResolvePivot ::
	ResolveOrdinalInOrderByAndGroupBy ::
	ResolveAggAliasInGroupBy ::
	ResolveMissingReferences ::
	ExtractGenerator ::
	ResolveGenerate ::
	ResolveFunctions ::
	ResolveAliases ::
	ResolveSubquery ::
	ResolveSubqueryColumnAliases ::
	ResolveWindowOrder ::
	ResolveWindowFrame ::
	ResolveNaturalAndUsingJoin ::
	ResolveOutputRelation ::
	ExtractWindowExpressions ::
	GlobalAggregates ::
	ResolveAggregateFunctions ::
	TimeWindowing ::
	ResolveInlineTables(conf) ::
	ResolveHigherOrderFunctions(catalog) ::
	ResolveLambdaVariables(conf) ::
	ResolveTimeZone(conf) ::
	ResolveRandomSeed ::
	TypeCoercion.typeCoercionRules(conf) ++
	extendedResolutionRules: _*),
	Batch("Post-Hoc Resolution", Once, postHocResolutionRules: _*),
	Batch("View", Once,
	AliasViewChild(conf)),
	Batch("Nondeterministic", Once,
	PullOutNondeterministic),
	Batch("UDF", Once,
	HandleNullInputsForUDF),
	Batch("FixNullability", Once,
	FixNullability),
	Batch("Subquery", Once,
	UpdateOuterReferences),
	Batch("Cleanup", fixedPoint,
	CleanupAliases),
	Batch("LineageTrack", fixedPoint,
	tailResolutionRules:_*
	)
	)

	def resolveOperatorsUp(rule: PartialFunction[LogicalPlan, LogicalPlan]): LogicalPlan = {
	if (!analyzed) {
	AnalysisHelper.allowInvokingTransformsInAnalyzer {
	// _.resolveOperatorsUp(rule) 这里相当于传入一个匿名函数
	val afterRuleOnChildren = mapChildren(_.resolveOperatorsUp(rule))
	if (self fastEquals afterRuleOnChildren) {
	CurrentOrigin.withOrigin(origin) {
	// applyOrElse接收2个参数，第一个是调用的参数，第二个是个回调函数。如果第一个调用的参数匹配，返回匹配的值，否则调用回调函数。
	rule.applyOrElse(self, identity[LogicalPlan])
	}
	} else {
	CurrentOrigin.withOrigin(origin) {
	rule.applyOrElse(afterRuleOnChildren, identity[LogicalPlan])
	}
	}
	}
	} else {
	self
	}
	}

	abstract class LogicalPlan
	extends QueryPlan[LogicalPlan]
	with AnalysisHelper
	with LogicalPlanStats
	with QueryPlanConstraints
	with LineageHelper
	with Logging {

	trait LineageHelper extends QueryPlan[LogicalPlan] with LineageEntity{
	self: LogicalPlan =>
	private var _lineageResolved: Boolean = false

	/**
	* Returns true if all the children of this expression have been resolved to a specific schema
	* and false if any still contains any unresolved placeholders.
	*/
	def childrenLineageResolved: Boolean = children.forall(_.lineageResolved)

	/**
	* 判断下游的lineage是否已经被解析（即生成了对应的Col对象）
	*/
	lazy val lineageResolved: Boolean = childrenLineageResolved

	/**
	* 标记对象为已被解析
	*/
	def markLineageResolved(): Unit = {
	_lineageResolved = true
	}
	}

	val tailResolutionRules: Seq[Rule[LogicalPlan]] = Seq(new ResolveLineage.ResolveRelation,
	new ResolveLineage.ResolveExpression)

	override protected def analyzer: Analyzer = new Analyzer(catalog, conf) {
	override val extendedResolutionRules: Seq[Rule[LogicalPlan]] =
	new ResolveHiveSerdeTable(session) +:
	new FindDataSourceTable(session) +:
	new ResolveSQLOnFile(session) +:
	customResolutionRules

	override val postHocResolutionRules: Seq[Rule[LogicalPlan]] =
	new DetermineTableStats(session) +:
	RelationConversions(conf, catalog) +:
	PreprocessTableCreation(session) +:
	PreprocessTableInsertion(conf) +:
	DataSourceAnalysis(conf) +:
	HiveAnalysis +:
	customPostHocResolutionRules

	override val extendedCheckRules: Seq[LogicalPlan => Unit] =
	PreWriteCheck +:
	PreReadCheck +:
	customCheckRules

	override val tailResolutionRules: Seq[Rule[LogicalPlan]] =
	Seq(new ResolveHiveRelation,
	new ResolveLineage.ResolveExpression)

	}