Starting Hive
Configuring Your Hadoop Environment
Local Mode Configuration
Distributed and Pseudodistributed Mode Configuration
Metastore
Using JDBC
The Hive Command
Command Options
The Command-Line Interface
CLI Options
Variables and Properties
Hive "One Shot" Commands
Executing Hive Queries from Files
The hiverc File
More on Using the Hive CLI
Command History
Shell Execution
Hadoop dfs Commands from Inside Hive
Comments in Hive Scripts
Query Column Headers Data Types and File Formats
Primitive Data Types
Collection Data Types
Text File Encoding of Data Values
Schema on Read
HiveQL: Data Definition
Databases in Hive
Alter Database
Creating Tables
Managed Tables
External Tables
Partitioned, Managed Tables
External Partitioned Tables
Customizing Table Storage Formats
Dropping Tables
Alter Table
Renaming a Table
Adding, Modifying, and Dropping a Table Partition
Changing Columns
Adding Columns
Deleting or Replacing Columns
Alter Table Properties
Alter Storage Properties
Miscellaneous Alter Table Statements
HiveQL: Data Manipulation
Loading Data into Managed Tables
Inserting Data into Tables from Queries
Dynamic Partition Inserts
Creating Tables and Loading Them in One Query
Exporting Data HiveQL: Queries
SELECT … FROM Clauses
Specify Columns with Regular Expressions
Computing with Column Values
Arithmetic Operators
Using Functions
LIMIT Clause
Column Aliases
Nested SELECT Statements
CASE … WHEN … THEN Statements
When Hive Can Avoid MapReduce
WHERE Clauses
Predicate Operators
Gotchas with Floating-Point Comparisons
LIKE and RLIKE
GROUP BY Clauses
HAVING Clauses
JOIN Statements
Inner JOIN
Join Optimizations
LEFT OUTER JOIN
OUTER JOIN Gotcha
RIGHT OUTER JOIN
FULL OUTER JOIN
LEFT SEMI-JOIN
Cartesian Product JOINs
Map-side Joins
ORDER BY and SORT BY
DISTRIBUTE BY with SORT BY
CLUSTER BY
Casting
Casting BINARY Values
Queries that Sample Data
Block Sampling
Input Pruning for Bucket Tables
UNION ALL
HiveQL: Views
Views to Reduce Query Complexity
Views that Restrict Data Based on Conditions
Views and Map Type for Dynamic Tables
View Odds and Ends HiveQL: Indexes
Creating an Index
Bitmap Indexes
Rebuilding the Index
Showing an Index
Dropping an Index
Implementing a Custom Index Handler
Schema Design
Table-by-Day
Over Partitioning
Unique Keys and Normalization
Making Multiple Passes over the Same Data
The Case for Partitioning Every Table
Bucketing Table Data Storage
Adding Columns to a Table
Using Columnar Tables
Repeated Data
Many Columns
(Almost) Always Use Compression!
Tuning
Using EXPLAIN
EXPLAIN EXTENDED
Limit Tuning
Optimized Joins
Local Mode
Parallel Execution
Strict Mode
Tuning the Number of Mappers and Reducers
JVM Reuse
Indexes
Dynamic Partition Tuning
Speculative Execution
Single MapReduce MultiGROUP BY
Virtual Columns
Other File Formats and Compression
Determining Installed Codecs
Choosing a Compression Codec
Enabling Intermediate Compression
Final Output Compression
Sequence Files
Compression in Action
Archive Partition
Compression: Wrapping Up Functions A UDF for Finding a Zodiac Sign from a Day UDF Versus GenericUDF Permanent Functions User-Defined Aggregate Functions Creating a COLLECT UDAF to Emulate GROUP_CONCAT User-Defined Table Generating Functions UDTFs that Produce Multiple Rows UDTFs that Produce a Single Row with Multiple Columns UDTFs that Simulate Complex Types Accessing the Distributed Cache from a UDF Annotations for Use with Functions Deterministic Stateful DistinctLike Macros 163 164 164 164 165 166 169 171 172 172 177 177 179 179 182 184 184 184 185 185 14 Streaming 187 Identity Transformation Changing Types Projecting Transformation Manipulative Transformations Using the Distributed Cache 188 188 188 189 189 Table of Contents | vii Producing Multiple Rows from a Single Row Calculating Aggregates with Streaming CLUSTER BY, DISTRIBUTE BY, SORT BY GenericMR Tools for Streaming to Java Calculating Cogroups 190 191 192 194 196 15 Customizing Hive File and Record Formats 199 File Versus Record Formats Edward Capriolo is currently System Administrator at Media6degrees, where he helps design and maintain distributed data storage systems for the Internet advertising industry. Edward is a member of the Apache Software Foundation and a committer for the Hadoop-Hive project. He has experience as a developer, as well as a Linux and network administrator, and enjoys the rich world of open source software.

Dean Wampler is a Principal Consultant at Think Big Analytics, where he specializes in "Big Data" problems and tools like Hadoop and Machine Learning. Besides Big Data, he specializes in Scala, the JVM ecosystem, JavaScript, Ruby, functional and object-oriented programming, and Agile methods. Dean is a frequent speaker at industry and academic conferences on these topics. He has a Ph.D. in Physics from the University of Washington.

Jason Rutherglen is a software architect at Think Big Analytics and specializes in Big Data, Hadoop, search, and security. 