• 周日. 11月 27th, 2022

5G编程聚合网

5G时代下一个聚合的编程学习网

热门标签

Detailed explanation of hivesql execution plan

[db:作者]

1月 6, 2022

Hive SQL Description of the implementation plan for SQL The overall outline of actual execution , By implementing the plan, you can understand SQL The program is transforming into the execution logic of the corresponding computing engine , If you master the execution logic, you can better grasp the bottleneck of the program , In order to achieve more targeted optimization . In addition, it can help developers identify the seemingly equivalent SQL In fact, it is not equivalent , Seemingly unequal SQL It’s actually equivalent SQL. It can be said that the implementation plan is to open SQL A key to optimize the gate .

Want to learn SQL Implementation plan , You need to learn the command to view the execution plan :explain, In the query statement SQL Add keywords before explain It’s the basic way to view the execution plan .

This article was first published on the official account 【 Five minutes to learn big data

Learn to explain, It can be used in our work hive Bring great convenience !

see SQL Implementation plan of

Hive There are several kinds of information that can be viewed in the execution plan provided :

  • explain: View the basic information of the execution plan ;

  • explain dependency:dependency stay explain Statement produces additional information about the input in the plan . It shows the various properties of the input ;

  • explain authorization: see SQL Information about operation related permissions ;

  • explain vectorization: see SQL Vectorized description of information , Show why not Map and Reduce Vectorization . from Hive 2.3.0 Start supporting ;

  • explain analyze: Annotate the plan with the actual number of lines . from Hive 2.2.0 Start supporting ;

  • explain cbo: Output by Calcite The plan generated by the optimizer .CBO from Hive 4.0.0 Version starting support ;

  • explain locks: This is useful for understanding which locks the system will acquire to run the specified query .LOCKS from Hive 3.2.0 Start supporting ;

  • explain ast: Output the abstract syntax tree of the query .AST stay Hive 2.1.0 Version deleted , There is bug, dump AST May lead to OOM error , Will be in 4.0.0 Version fix ;

  • explain extended: add extended You can output additional information about the plan . This is usually physical information , For example, file name , This extra information is of little use to us ;

1. explain Usage of

Hive Provides explain Command to display the execution plan of a query , This execution plan is important for us to understand the underlying principles ,Hive tuning , It’s very helpful to check the data skew .

The syntax is as follows :

explain query;

stay hive cli Enter the following command (hive 2.3.7):

explain select sum(id) from test1;

Get the results :

STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: test1
Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int)
outputColumnNames: id
Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(id)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink

What do you feel after reading the above contents , Don’t you feel like you don’t understand , Don’t worry , Each parameter will be explained in detail below , I believe you’ll see it after you’ve learned the following explain The results of the query will be easy to handle .

One HIVE The query is converted to a query by one or more stage The sequence of components ( Directed acyclic graph DAG). these stage It can be MapReduce stage, It can also be responsible for metadata storage stage, It can also be responsible for the operation of the file system ( Like moving and renaming ) Of stage.

Let’s split the above results , Start with the outermost layer , There are two big parts :

  1. stage dependencies: each stage The dependence between
  2. stage plan: each stage Implementation plan of

Let’s start with the first part stage dependencies , Contains two stage,Stage-1 It’s a root stage, That’s the beginning stage,Stage-0 rely on Stage-1,Stage-1 Execute after execution is complete Stage-0.

Let’s look at the second part stage plan, There’s a Map Reduce, One MR There are two parts to the plan :

  1. Map Operator Tree: MAP The execution plan tree of the end
  2. Reduce Operator Tree: Reduce The execution plan tree of the end

These two execution plan trees contain this sql Of the statement operator:

  1. TableScan: Table scan operation ,map The first operation on the client side must be to load the table , So it’s a table scan operation , Common properties :
    • alias: The name of the table
    • Statistics: Table statistics , Contains the number of data in the table , Data size, etc
  2. Select Operator: Select operation , Common properties :
    • expressions: The required field name and field type
    • outputColumnNames: The name of the output column
    • Statistics: Table statistics , Contains the number of data in the table , Data size, etc
  3. Group By Operator: Group aggregation operations , Common properties :
    • aggregations: Display aggregate function information
    • mode: Aggregation mode , Value has hash: Random aggregation , Namely hash partition;partial: Local polymerization ;final: The final aggregation
    • keys: Grouped fields , If there are no groups , There is no such field
    • outputColumnNames: Output column name after aggregation
    • Statistics: Table statistics , Contains the number of data after grouping aggregation , Data size, etc
  4. Reduce Output Operator: Output to reduce operation , Common properties :
    • sort order: Value is empty Don’t order ; The value is + Positive sequence sort , The value is – Reverse sort ; The value is +- The sorted columns are two , The first column is in positive order , The second column is in reverse order
  5. Filter Operator: Filtering operation , Common properties :
    • predicate: Filter conditions , Such as sql Statement where id>=1, Then (id >= 1)
  6. Map Join Operator:join operation , Common properties :
    • condition map:join The way , Such as Inner Join 0 to 1 Left Outer Join0 to 2
    • keys: join The condition field for
    • outputColumnNames: join The output field after completion
    • Statistics: join The number of data generated after completion , Size, etc
  7. File Output Operator: File output operation , Common properties
    • compressed: Is it compressed?
    • table: List the information , Contains input and output file formats , Serialization, etc
  8. Fetch Operator The client obtains the data operation , Common properties :
    • limit, The value is -1 There is no limit to the number of entries , Other values are the number of restricted items

2. explain Usage scenarios of

This section describes explain What conveniences can be brought to us in production practice and what puzzles can be solved

Case a :join Statement will filter null Is it worth ?

Now? , We are hive cli Enter the following query plan statement

select a.id,b.user_name from test1 a join test2 b on a.id=b.id;

ask : This one above join Statement will filter id by null Is it worth

Execute the following statement :

explain select a.id,b.user_name from test1 a join test2 b on a.id=b.id;

Let’s look at the results ( In order to adapt to the page display , Only part of the output information is intercepted ):

TableScan
alias: a
Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int)
outputColumnNames: _col0
Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 _col0 (type: int)
1 _col0 (type: int)
...

It can be seen from the above results that predicate: id is not null This line of business , explain join It will automatically filter out the associated fields as null
Situation of value , but left join or full join It doesn’t filter automatically null It’s worth it
, You can try it yourself .

Case 2 :group by Will grouping statements be sorted ?

Look at this one below sql

select id,max(user_name) from test1 group by id;

ask :group by Will grouping statements be sorted

Look directly at explain And then it turned out ( In order to adapt to the page display , Only part of the output information is intercepted )

 TableScan
alias: test1
Statistics: Num rows: 9 Data size: 108 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int), user_name (type: string)
outputColumnNames: id, user_name
Statistics: Num rows: 9 Data size: 108 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: max(user_name)
keys: id (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 9 Data size: 108 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 9 Data size: 108 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: string)
...

Let’s see Group By Operator, There are keys: id (type: int) Explain according to id In groups , Look down and sort order: + , The explanation is according to id Fields are sorted in positive order .

Case three : Which one sql The implementation efficiency is high ?

Look at two sql sentence

SELECT
a.id,
b.user_name
FROM
test1 a
JOIN test2 b ON a.id = b.id
WHERE
a.id > 2;
SELECT
a.id,
b.user_name
FROM
(SELECT * FROM test1 WHERE id > 2) a
JOIN test2 b ON a.id = b.id;

These two sql The output of the statement is the same , But which one sql The implementation efficiency is high

Some people say the first one sql High execution efficiency , Because the second one sql There are sub queries , Subqueries affect performance ;

Some say the second sql High execution efficiency , Because filter first and then , It’s going on join The number of entries is reduced , So the execution efficiency is high .

Which one on earth sql High efficiency , We are directly in the sql Add… Before the statement explain, Just look at the implementation plan !

In the first one sql Add… Before the statement explain, The results are as follows

hive (default)> explain select a.id,b.user_name from test1 a join test2 b on a.id=b.id where a.id >2;
OK
Explain
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
$hdt$_0:a
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$hdt$_0:a
TableScan
alias: a
Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (id > 2) (type: boolean)
Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int)
outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (id > 2) (type: boolean)
Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int), user_name (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0, _col2
Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col2 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink

In the second sql Add… Before the statement explain, The results are as follows

hive (default)> explain select a.id,b.user_name from(select * from test1 where id>2 ) a join test2 b on a.id=b.id;
OK
Explain
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
$hdt$_0:test1
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$hdt$_0:test1
TableScan
alias: test1
Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (id > 2) (type: boolean)
Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int)
outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (id > 2) (type: boolean)
Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int), user_name (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 2 Data size: 25 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0, _col2
Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col2 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 27 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink

What do you find , Except for the table alias , The other implementation plans are exactly the same , It’s all done first where filter , It’s going on join Conditional correlation . explain hive The bottom layer will automatically help us optimize , So these two sql The efficiency of statement execution is the same .

The above is just a list of 3 This is an example that we are familiar with and a little confused in production ,explain There are many other uses , If you look at stage The dependence of 、 Check the data skew 、hive Adjustment , You can try it yourself .

2. explain dependency Usage of

explain dependency Used to describe a paragraph SQL Data sources needed , The output is a json Formatted data , It contains the following two parts :

  • input_partitions: Describe a paragraph SQL Dependent data source table partition , It stores a list of partition names , If the whole section SQL All included tables are non partitioned , The display is empty .

  • input_tables: Describe a paragraph SQL Dependent data source table , What’s in it is Hive List of table names .

Use explain dependency see SQL Query non partitioned normal tables , stay hive cli Enter the following command :

explain dependency select s_age,count(1) num from student_orc;

Get the results :

{"input_partitions":[],"input_tables":[{"tablename":"[email protected]_tb _orc","tabletype":"MANAGED_TABLE"}]}

Use explain dependency see SQL Query partition table , stay hive cli Enter the following command :

explain dependency select s_age,count(1) num from student_orc_partition;

Get the results :

{"input_partitions":[{"partitionName":"[email protected][email protected] part=0"},
{"partitionName":"[email protected][email protected]=1"},
{"partitionName":"[email protected][email protected]=2"},
{"partitionName":"[email protected][email protected]=3"},
{"partitionName":"[email protected][email protected]=4"},
{"partitionName":"[email protected][email protected]=5"},
{"partitionName":"[email protected][email protected]=6"},
{"partitionName":"[email protected][email protected]=7"},
{"partitionName":"[email protected][email protected]=8"},
{"partitionName":"[email protected][email protected]=9"}],
"input_tables":[{"tablename":"[email protected]_orc_partition", "tabletype":"MANAGED_TABLE"}]

explain dependency There are two scenarios for using :

  • Scene one : Quick exclusion . Quickly eliminate the abnormal task data output caused by unable to read the data of the corresponding partition . for example , In a mission divided by days , Because of uncontrollable factors in the production process, upstream tasks are abnormal or run away , Causes the downstream task to throw an exception . In this way , You can quickly view SQL Whether the read partition is abnormal .

  • Scene two : Sort out the input of the table , Help understand how the program works , In particular, it helps to understand how many subqueries there are , Dependent input of multi table join .

Here are two cases explain dependency Practical application of :

Case a : Identify seemingly equivalent code

For rigid contact SQL The programmer , It’s easy to put

select * from a inner join b on a.no=b.no and a.f>1 and a.f<3;

Equivalent to

select * from a inner join b on a.no=b.no where a.f>1 and a.f<3;

We can see the difference through cases :

Code 1

select
a.s_no
from student_orc_partition a
inner join
student_orc_partition_only b
on a.s_no=b.s_no and a.part=b.part and a.part>=1 and a.part<=2;

Code 2

select
a.s_no
from student_orc_partition a
inner join
student_orc_partition_only b
on a.s_no=b.s_no and a.part=b.part
where a.part>=1 and a.part<=2;

Let’s take a look at these two pieces of code explain dependency Output result of :

Code 1 Of explain dependency result

{"input_partitions":
[{"partitionName":"[email protected][email protected]=0"},
{"partitionName":"[email protected][email protected]=1"},
{"partitionName":"[email protected][email protected]=2"},
{"partitionName":"[email protected][email protected]=1"},
{"partitionName":"[email protected][email protected]=2"}],
"input_tables": [{"tablename":"[email protected]_orc_partition","tabletype":"MANAGED_TABLE"}, {"tablename":"[email protected]_orc_partition_only","tabletype":"MANAGED_TABLE"}]}

Code 2 Of explain dependency result

{"input_partitions":
[{"partitionName":"[email protected][email protected]=1"},
{"partitionName" : "[email protected][email protected]=2"},
{"partitionName" :"[email protected][email protected]=1"},
{"partitionName":"[email protected][email protected]=2"}],
"input_tables": [{"tablename":"[email protected]_orc_partition","tabletype":"MANAGED_TABLE"}, {"tablename":"[email protected]_orc_partition_only","tabletype":"MANAGED_TABLE"}]}

From the output above, we can see that , In fact, the above two SQL Not equivalent , Code 1 Internal connection (inner join) Connection conditions in (on) After adding non equivalent filtering conditions into the filter , The left and right tables of inner connection are not filtered according to filtering conditions , Inner connections are read more during execution part=0 Partition data for . And in code 2 in , Will filter out unqualified partitions .

Case 2 : distinguish SQL The difference in the range of read data

Code 1

explain dependency
select
a.s_no
from student_orc_partition a
left join
student_orc_partition_only b
on a.s_no=b.s_no and a.part=b.part and b.part>=1 and b.part<=2;

Code 2

explain dependency
select
a.s_no
from student_orc_partition a
left join
student_orc_partition_only b
on a.s_no=b.s_no and a.part=b.part and a.part>=1 and a.part<=2;

Is the data reading range of the above two codes the same ? The answer is different , We go through explain dependency Take a look :

Code 1 Of explain dependency result

{"input_partitions":
[{"partitionName": "[email protected][email protected]=0"},
{"partitionName":"[email protected][email protected]=1"}, … Middle ellipsis 7 Zones
{"partitionName":"[email protected][email protected]=9"},
{"partitionName":"[email protected][email protected]=1"},
{"partitionName":"[email protected][email protected]=2"}],
"input_tables": [{"tablename":"[email protected]_orc_partition","tabletype":"MANAGED_TABLE"}, {"tablename":"[email protected]_orc_partition_only","tabletype":"MANAGED_TABLE"}]}

Code 2 Of explain dependency result

{"input_partitions":
[{"partitionName":"[email protected][email protected]=0"},
{"partitionName":"[email protected][email protected]=1"}, … Middle ellipsis 7 Zones
{"partitionName":"[email protected][email protected]=9"},
{"partitionName":"[email protected][email protected]=0"},
{"partitionName":"[email protected][email protected]=1"}, … Middle ellipsis 7 Zones
{"partitionName":"[email protected][email protected]=9"}],
"input_tables": [{"tablename":"[email protected]_orc_partition","tabletype":"MANAGED_TABLE"}, {"tablename":"[email protected]_orc_partition_only","tabletype":"MANAGED_TABLE"}]}

You can see , For the left outer join, the non equivalent filtering condition is added to the join condition , If the filter condition is applied to the right table (b surface ) It has the effect of filtering , Then the right table only needs to scan two partitions , But the left table (a surface ) A full scan will be performed . If the filter condition is for the left table , It doesn’t filter at all , Then the two tables will be scanned in full . At this time, the situation is just like all external connections, both of which need to scan the whole table of the two data .

In use , It’s easy to think of code snippets 2 It can be like code snippets 1 Data filtering as well , By looking at explain dependency Output result of , You can see that’s not the case .

3. explain authorization Usage of

adopt explain authorization We can know the present SQL Data sources accessed (INPUTS) And data output (OUTPUTS), And current Hive Access users (CURRENT_USER) And operation (OPERATION).

stay hive cli Enter the following command :

explain authorization
select variance(s_score) from student_tb_orc;

give the result as follows :

INPUTS:
[email protected]_tb_orc
OUTPUTS:
hdfs://node01:8020/tmp/hive/hdfs/cbf182a5-8258-4157-9194- 90f1475a3ed5/-mr-10000
CURRENT_USER:
hdfs
OPERATION:
QUERY
AUTHORIZATION_FAILURES:
No privilege 'Select' found for inputs { database:default, table:student_ tb_orc, columnName:s_score}

From the information above :

The data source of the above case is defalut In the database student_tb_orc surface ;

The output path of the data is hdfs://node01:8020/tmp/hive/hdfs/cbf182a5-8258-4157-9194-90f1475a3ed5/-mr-10000;

The current operating user is hdfs, Operations are queries ;

Look at the information above and we’ll see AUTHORIZATION_FAILURES Information , Prompt has no query permission for the current input , But if you run the above SQL It can also work normally . Why does this happen ?Hive Do not verify permissions without configuring permission management by default , All users are in Hive It’s full of super Administrators , Even without empowering specific users , It can also query normally .

Last

By facing up to explain Introduction to , You can find explain There are a lot of things that we should study , Read explain Our implementation plan will help us optimize Hive SQL, It can also improve our understanding of SQL The power of control .

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注