Json to Avro Conversion
When an Airbyte data stream is synced to the Avro or Parquet format (e.g. Parquet on S3), the source Json schema is converted to an Avro schema, then the Json object is converted to an Avro record based on the Avro schema (and further to Parquet if necessary). Because the data stream can come from any data source, the Json to Avro conversion process has the following rules and limitations.
  1. 1.
    Json schema types are mapped to Avro types as follows:
    Json Data Type
    Avro Data Type
    string
    string
    number
    double
    integer
    int
    boolean
    boolean
    null
    null
    object
    record
    array
    array
  2. 2.
    Built-in Json schema date-time formats will be mapped to Avro logical types
Date
The date logical type represents a date within the calendar, with no reference to a particular time zone or time of day.
A date logical type annotates an Avro int, where the int stores the number of days from the unix epoch, 1 January 1970 (ISO calendar).
1
{
2
"type": "string",
3
"format": "date"
4
}
Copied!
will become in Avro schema:
1
{
2
"type": "int",
3
"logicalType": "date"
4
}
Copied!
Time (microsecond precision)
The time-micros logical type represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one microsecond.
A time-micros logical type annotates an Avro long, where the long stores the number of microseconds after midnight, 00:00:00.000000.
1
{
2
"type": "string",
3
"format": "time"
4
}
Copied!
will become in Avro schema:
1
{
2
"type": "long",
3
"logicalType": "time-micros"
4
}
Copied!
Timestamp (microsecond precision)
The timestamp-micros logical type represents an instant on the global timeline, independent of a particular time zone or calendar, with a precision of one microsecond.
A timestamp-micros logical type annotates an Avro long, where the long stores the number of microseconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC.
1
{
2
"type": "string",
3
"format": "date-time"
4
}
Copied!
will become in Avro schema:
1
{
2
"type": "long",
3
"logicalType": "timestamp-micros"
4
}
Copied!
  1. 1.
    Combined restrictions ("allOf", "anyOf", and "oneOf") will be converted to type unions. The corresponding Avro schema can be less stringent. For example, the following Json schema
    1
    {
    2
    "oneOf": [
    3
    { "type": "string" },
    4
    { "type": "integer" }
    5
    ]
    6
    }
    Copied!
    will become this in Avro schema:
    1
    {
    2
    "type": ["null", "string", "int"]
    3
    }
    Copied!
  2. 2.
    Keyword not is not supported, as there is no equivalent validation mechanism in Avro schema.
  3. 3.
    Only alphanumeric characters and underscores (/a-zA-Z0-9_/) are allowed in a stream or field name. Any special character will be converted to an alphabet or underscore. For example, spécial:character_names will become special_character_names. The original names will be stored in the doc property in this format: _airbyte_original_name:<original-name>.
  4. 4.
    The field name cannot start with a number, so an underscore will be added to the field name at the beginning.
  5. 5.
    All field will be nullable. For example, a string Json field will be typed as ["null", "string"] in Avro. This is necessary because the incoming data stream may have optional fields.
  6. 6.
    For array fields in Json schema, when the items property is an array, it means that each element in the array should follow its own schema sequentially. For example, the following specification means the first item in the array should be a string, and the second a number.
    1
    {
    2
    "array_field": {
    3
    "type": "array",
    4
    "items": [
    5
    { "type": "string" },
    6
    { "type": "number" }
    7
    ]
    8
    }
    9
    }
    Copied!
    This is not supported in Avro schema. As a compromise, the converter creates a union, ["string", "number"], which is less stringent:
    1
    {
    2
    "name": "array_field",
    3
    "type": [
    4
    "null",
    5
    {
    6
    "type": "array",
    7
    "items": ["null", "string"]
    8
    }
    9
    ],
    10
    "default": null
    11
    }
    Copied!
  7. 7.
    Three Airbyte specific fields will be added to each Avro record:
    Field
    Schema
    Document
    _airbyte_ab_id
    uuid
    link
    _airbyte_emitted_at
    timestamp-millis
    link
    _airbyte_additional_properties
    map of string
    See explanation below.
  8. 8.
    A Json object can have additional properties of unknown types, which is not compatible with the Avro schema. To solve this problem during Json to Avro object conversion, we introduce a special field: _airbyte_additional_properties typed as a nullable map from string to string:
1
{
2
"name": "_airbyte_additional_properties",
3
"type": ["null", { "type": "map", "values": "string" }],
4
"default": null
5
}
Copied!
For example, given the following Json schema:
1
{
2
"type": "object",
3
"$schema": "http://json-schema.org/draft-07/schema#",
4
"properties": {
5
"username": {
6
"type": ["null", "string"]
7
}
8
}
9
}
Copied!
this Json object
1
{
2
"username": "admin",
3
"active": true,
4
"age": 21,
5
"auth": {
6
"auth_type": "ssl",
7
"api_key": "abcdefg/012345",
8
"admin": false,
9
"id": 1000
10
}
11
}
Copied!
will be converted to the following Avro object:
1
{
2
"username": "admin",
3
"_airbyte_additional_properties": {
4
"active": "true",
5
"age": "21",
6
"auth": "{\"auth_type\":\"ssl\",\"api_key\":\"abcdefg/012345\",\"admin\":false,\"id\":1000}"
7
}
8
}
Copied!
Note that all fields other than the username is moved under _ab_additional_properties as serialized strings, including the original object auth.
  1. 1.
    Based on the above rules, here is an overall example. Given the following Json schema:
1
{
2
"type": "object",
3
"$schema": "http://json-schema.org/draft-07/schema#",
4
"properties": {
5
"id": {
6
"type": "integer"
7
},
8
"user": {
9
"type": ["null", "object"],
10
"properties": {
11
"id": {
12
"type": "integer"
13
},
14
"field_with_spécial_character": {
15
"type": "integer"
16
}
17
}
18
},
19
"created_at": {
20
"type": ["null", "string"],
21
"format": "date-time"
22
}
23
}
24
}
Copied!
Its corresponding Avro schema will be:
1
{
2
"name": "stream_name",
3
"type": "record",
4
"fields": [
5
{
6
"name": "_airbyte_ab_id",
7
"type": {
8
"type": "string",
9
"logicalType": "uuid"
10
}
11
},
12
{
13
"name": "_airbyte_emitted_at",
14
"type": {
15
"type": "long",
16
"logicalType": "timestamp-millis"
17
}
18
},
19
{
20
"name": "id",
21
"type": ["null", "int"],
22
"default": null
23
},
24
{
25
"name": "user",
26
"type": [
27
"null",
28
{
29
"type": "record",
30
"name": "user",
31
"fields": [
32
{
33
"name": "id",
34
"type": ["null", "int"],
35
"default": null
36
},
37
{
38
"name": "field_with_special_character",
39
"type": ["null", "int"],
40
"doc": "_airbyte_original_name:field_with_spécial_character",
41
"default": null
42
},
43
{
44
"name": "_airbyte_additional_properties",
45
"type": ["null", { "type": "map", "values": "string" }],
46
"default": null
47
}
48
]
49
}
50
],
51
"default": null
52
},
53
{
54
"name": "created_at",
55
"type": ["null", "string"],
56
"default": null
57
},
58
{
59
"name": "_airbyte_additional_properties",
60
"type": ["null", { "type": "map", "values": "string" }],
61
"default": null
62
}
63
]
64
}
Copied!
Last modified 2d ago
Copy link
Edit on GitHub