Analytics/Cluster/Systems/Hive/Avro

From Wikitech

This describes how to do evolution of avro schemas in hive.

Intro

Despite of what you might read here: https://cwiki.apache.org/confluence/display/Hive/AvroSerDe we have found that avro in hive has several gotchas. Thus we are documenting what has worked for us.


Create tables with avro schema

We have found that easies is to use the avro.schema.literal. property.

Example of how to create a table (if schema is too large you might need to remove all whitespace)

CREATE TABLE nuria_avro
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
'avro.schema.literal'='
{
    "type": "record",
     "namespace": "org.wikimedia.analytics.schemas",
    "name": "testTable",
    "fields": [
        {"name": "userName",        "type": "string", "default":""},
        {"name": "favouriteNumber", "type": "int", "default":1},
        {"name": "interests",       "type": {"type": "array", "items": "string"}, "default":[]}
    ]
}');


Update schema

Updating schema requires updating the TBLPROPERTY. Please note that optional fields in avro have to be defined a certain way using unions, see specifications for details: [1]


ALTER TABLE nuria_avro SET TBLPROPERTIES ('avro.schema.literal'='
{
    "type": "record",
    "namespace": "org.wikimedia.analytics.schemas",
    "name": "testTable",
    "fields": [
        {"name": "userName",        "type": "string", "default":""},
        {"name":"lastName",         "type":["string", "null"], "default":""},
        {"name": "favouriteNumber", "type": "int", "default":1},
        {"name": "interests",       "type": {"type": "array", "items": "string"}, "default":[]}
    ]
}');

External Tables

When interacting with data that is not created within hive, an external table needs to be created to point at that data.

CREATE EXTERNAL TABLE `avro_test`
PARTITIONED BY ( 
  `year` string, 
  `month` string, 
  `day` string, 
  `hour` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
  'hdfs://analytics-hadoop/wmf/data/raw/mediawiki/mediawiki_testTable/hourly'
TBLPROPERTIES (
  'avro.schema.literal'='{
    "type": "record",
    "namespace": "org.wikimedia.analytics.schemas",
    "name": "testTable",
    "fields": [
        {"name": "userName",        "type": "string", "default":""},
        {"name":"lastName",         "type":["string", "null"], "default":""},
        {"name": "favouriteNumber", "type": "int", "default":1}
    ]
  }'
)

Mediawiki and Avro

Analytics/Cluster/MediaWiki_Avro_Logging