The hustle module contains everything a client application will typically need to create hustle Tables and to insert and query data to/from them.
Hustle Tables are stored in multiple, LMDB memory mapped files, which are replicated into Disco’s Distributed File System (DDFS). Queries are run using a Python DSL which is translated dynamically into a Disco pipelined map/reduce job.
Hustle is ideal for low latency OLAP queries over massive data sets.
The fundamental data type to support Hustle’s relational model. A Table contains a number of named Columns, each of which is decorated with schema information. Note that the table is stored in Disco’s DDFS distributed file system as a series of replicated sub-database files encapsulated by LMDB memory mapped b+ tree files called Marbles. Each Marble contains the data for the rows and columns of the Table.
Normally, a table is created using create(), which creates the appropriately named DDFS tag and attributes. To instantiate an existing Table (to use in a query, for example), the from_tag() method is used.
see Hustle Schema Design Guide for a detailed look at Hustle’s schema design language and features.
Instantiate a named Table based on meta data from a DDFS tag.
Parameters: | name (string) – the name of the table |
---|
Create a new Table, replace existing table if force=True.
Parameters: |
|
---|
If columns is set, the fields parameter is ignored.
Example:
pixels = Table.create('pixels',
columns=['index string token', 'index uint8 isActive', 'index site_id', 'uint32 amount',
'index int32 account_id', 'index city', 'index trie16 state', 'index int16 metro',
'string ip', 'lz4 keyword', 'index string date'],
partition='date',
force=True)
Warning
This function will not delete or update existing data in any way. If you use force=True to change the schema, make sure you either make the change backward compatible (by only adding new columns), or by deleting and reloading your data.
See also
For a good example of creating a partitioned Hustle database see Hustle Integration Test Suite For detailed schema design docs look no further than Hustle Schema Design Guide
Insert data into a Hustle Table.
Create a Marble file given the input file or streams according to the schema of the table. Push this (these) file(s) into DDFS under the appropriated (possibly) partitioned DDFS tags.
Note that a call to insert() may actually create and push more than one file, depending on how many partition values exist in the input. Be careful.
For a good example of inserting into a partitioned Hustle database see Inserting Data To Hustle
Parameters: |
|
---|
Perform a relational query, by selecting rows and columns from one or more tables.
The return value is either:
* an iterator over the resulting tuples when :code:`nest==False`
* a :class:`Table <hustle.Table>` instance when :code:`nest==True`
* in the case of :code:`nest==False and dump==True` return None (this is the default CLI interaction)
For all of the examples below, imps and pix are instances of Table.
Parameters: |
|
---|
a Future object
Parameters: |
|
---|
Return an aggregation for the sum of the given column. Like SQL sum() function. This is used in hustle.select() calls to specify the sum aggregation over a column in a query:
select(h_sum(employee.salary), employee.department, where=employee.age > 25)
returns the total salaries for each departments employees over 25 years old
Parameters: | col (hustle.core.marble.Column) – the column to aggregate |
---|
Return an aggregation for the count of each grouped key in a query. Like SQL count() function:
select(h_count(), employee.department, where=employee)
returns a count of the number of employees in each department.
Return an aggregation for the maximum of the given column. Like the SQL max() function:
select(h_max(employee.salary), employee.department, where=employee)
returns the highest salary for each department.
Parameters: | col (hustle.core.marble.Column) – the column to aggregate |
---|
Return an aggregation for the minimum of the given column. Like the SQL min() function:
select(h_min(employee.salary), employee.department, where=employee)
returns the lowest salary in each department.
Parameters: | col (hustle.core.marble.Column) – the column to aggregate |
---|
Return an aggregation for the average of the given column. Like the SQL avg() function:
select(h_avg(employee.salary), employee.department, where=employee)
returns the average salary in each department
Parameters: | col (hustle.core.marble.Column) – the column to aggregate |
---|
Return the list of all columns in a table. This is used much like the * notation in SQL:
``select(*star(employee), where=employee.department == 'Finance')``
returns all of the columns from the employee table for the Finance department.
Parameters: | col – the table to extract the column names from |
---|
Pretty print the results of a query or table.
Parameters: |
|
---|
return the visible Hustle tables in the currently configured DDFS server. Hustle finds tables by looking for DDFS tags that have a hustle: prefix.
Parameters: | kwargs (dict) – custom settings for this query see hustle.core.settings |
---|
Print all available tables.
Parameters: | kwargs (dict) – custom settings for this query see hustle.core.settings |
---|
Print the schema for a given table
Parameters: | kwargs (dict) – custom settings for this query see hustle.core.settings |
---|
Get partitions for a given table.
Parameters: | kwargs (dict) – custom settings for this query see hustle.core.settings |
---|
Print the partitions for a given table.
Parameters: | kwargs (dict) – custom settings for this query see hustle.core.settings |
---|
Delete data and partitions for a given table, keep the table definition.
Parameters: |
---|
Warning
Given a table object, all partitions will be deleted. Use a Hustle expression to delete a specific range of partitions, e.g. ‘impression.date < 2014-01-01’.
Drop all data, partitions, and table definition for a given table.
Parameters: |
---|
Return value of non-blocking function select_nb().
It has a series of functions to check query’s status and fetch query’s results.
Return the status of the attached query.
Block and wait the query to finish. If the query is already finished, it returns the result at once.
The return value is the same as select(). If query is nested, it returns a table. Otherwise, a list of urls.
A helper function that shows whether query is done or not
A helper function for the nested query to return its table result.
If query is not nested or query is not finished, an exception will be raised.