ketl.db package

Submodules

ketl.db.models module

class ketl.db.models.API(**kwargs)[source]

Bases: sqlalchemy.orm.decl_api.Base, ketl.extractor.Rest.RestMixin

The API class is the center of the organizational model for kETL. It configures the basic logic of accessing some set of resources, setting up credentials as needed.

BATCH_SIZE = 10000
property api_hash: str

Hash the API by hashing all of its sources and return the hex digest.

Returns

Hex digest of the hash.

property cached_files: sqlalchemy.orm.query.Query

Retrieve a list of all CachedFile configured for this api and stored in the database.

Returns

a batched query object.

cached_files_on_disk(use_hash=True, missing=False, limit_ids=None) sqlalchemy.orm.query.Query[source]

Retrieve a list of all cached files thought to (or known to) be on disk.

Parameters
  • use_hash – if true, use the fact that the file has a hash in the db as evidence of existence. if false, actually checks whether the file is present at its path.

  • missing – return any files that are configured but missing from disk.

  • limit_ids – limit the result set to the specific ids supplied.

Returns

a batched query

creds
description
property expected_files: sqlalchemy.orm.query.Query

Retrieve all the expected files under this API.

Returns

A list of expected files.

static get_instance(model: Type[ketl.db.models.API], name=None) ketl.db.models.API[source]

Retrieve an instance of the given subclass of API. There can only be one instance per name.

Parameters
  • model – A subclass of API.

  • name – An optional name. Only one API per name is allowed.

Returns

An instance of the provided subclass of API.

hash
id
name
abstract setup()[source]

All subclasses of API must implement the setup method to generate the actual configuration that will specify what is to be downloaded.

Returns

sources
class ketl.db.models.CachedFile(**kwargs)[source]

Bases: sqlalchemy.orm.decl_api.Base

The CachedFile class represents a single file that may be downloaded by an extractor.

BLOCK_SIZE = 65536
cache_type
expected_files
expected_mode
extract_to
property file_hash

Return the hash of the file.

Returns

The hash object (not the digest or the hex digest!) of the file.

property full_path: pathlib.Path

Return the absolute path of the cached file.

Returns

The absolute path of the file.

property full_url: str
hash
id
is_archive
last_download
last_update
meta
path
preprocess(overwrite_on_extract=True) Optional[dict][source]

Preprocess the file, extracting and creating expected files as needed.

Returns

Optionally returns an expected file, if one was created directly from the cached file. Otherwise returns None.

refresh_interval
size
source
source_id
url
url_params
class ketl.db.models.Creds(**kwargs)[source]

Bases: sqlalchemy.orm.decl_api.Base

A simple class for keeping track of credentials. Details are stored in a JSON blob.

SECURITY WARNING: creds are currently stored unencrypted. Don’t put anything in here that requires real security.

api_config
api_config_id
creds_details
id
class ketl.db.models.ExpectedFile(**kwargs)[source]

Bases: sqlalchemy.orm.decl_api.Base

A class representing expected files to actually be processed.

BLOCK_SIZE = 65536
archive_path
cached_file
cached_file_id
property file_hash

Hash the expected file.

Returns

The hash object.

file_type
hash
id
last_processed
meta
path
processed
size
class ketl.db.models.ExpectedMode(value)[source]

Bases: enum.Enum

An enum representing the various ways of generating ExpectedFile s from CachedFile s.

auto = 'auto'
explicit = 'explicit'
self = 'self'
exception ketl.db.models.InvalidConfigurationError[source]

Bases: Exception

Exception indicating an invalid configuration.

class ketl.db.models.Source(**kwargs)[source]

Bases: sqlalchemy.orm.decl_api.Base

A class representing a source of some data. Can be subclassed on source type.

api_config
api_config_id
base_url
data_dir
property expected_files: List[ketl.db.models.ExpectedFile]

Return a list of expected files for the given source.

Returns

a list of ExpectedFile s.

id
meta
source_files
property source_hash
source_type

ketl.db.settings module

ketl.db.settings.get_engine(conn_string='sqlite:///ketl.db')[source]
ketl.db.settings.get_session() sqlalchemy.orm.session.Session[source]

Module contents