ketl.db package
Submodules
ketl.db.models module
- class ketl.db.models.API(**kwargs)[source]
Bases:
sqlalchemy.orm.decl_api.Base,ketl.extractor.Rest.RestMixinThe API class is the center of the organizational model for kETL. It configures the basic logic of accessing some set of resources, setting up credentials as needed.
- BATCH_SIZE = 10000
- property api_hash: str
Hash the API by hashing all of its sources and return the hex digest.
- Returns
Hex digest of the hash.
- property cached_files: sqlalchemy.orm.query.Query
Retrieve a list of all
CachedFileconfigured for this api and stored in the database.- Returns
a batched query object.
- cached_files_on_disk(use_hash=True, missing=False, limit_ids=None) sqlalchemy.orm.query.Query[source]
Retrieve a list of all cached files thought to (or known to) be on disk.
- Parameters
use_hash – if true, use the fact that the file has a hash in the db as evidence of existence. if false, actually checks whether the file is present at its path.
missing – return any files that are configured but missing from disk.
limit_ids – limit the result set to the specific ids supplied.
- Returns
a batched query
- creds
- description
- property expected_files: sqlalchemy.orm.query.Query
Retrieve all the expected files under this API.
- Returns
A list of expected files.
- static get_instance(model: Type[ketl.db.models.API], name=None) ketl.db.models.API[source]
Retrieve an instance of the given subclass of API. There can only be one instance per name.
- Parameters
model – A subclass of API.
name – An optional name. Only one API per name is allowed.
- Returns
An instance of the provided subclass of API.
- hash
- id
- name
- abstract setup()[source]
All subclasses of API must implement the setup method to generate the actual configuration that will specify what is to be downloaded.
- Returns
- sources
- class ketl.db.models.CachedFile(**kwargs)[source]
Bases:
sqlalchemy.orm.decl_api.BaseThe CachedFile class represents a single file that may be downloaded by an extractor.
- BLOCK_SIZE = 65536
- cache_type
- expected_files
- expected_mode
- extract_to
- property file_hash
Return the hash of the file.
- Returns
The hash object (not the digest or the hex digest!) of the file.
- property full_path: pathlib.Path
Return the absolute path of the cached file.
- Returns
The absolute path of the file.
- property full_url: str
- hash
- id
- is_archive
- last_download
- last_update
- meta
- path
- preprocess(overwrite_on_extract=True) Optional[dict][source]
Preprocess the file, extracting and creating expected files as needed.
- Returns
Optionally returns an expected file, if one was created directly from the cached file. Otherwise returns None.
- refresh_interval
- size
- source
- source_id
- url
- url_params
- class ketl.db.models.Creds(**kwargs)[source]
Bases:
sqlalchemy.orm.decl_api.BaseA simple class for keeping track of credentials. Details are stored in a JSON blob.
SECURITY WARNING: creds are currently stored unencrypted. Don’t put anything in here that requires real security.
- api_config
- api_config_id
- creds_details
- id
- class ketl.db.models.ExpectedFile(**kwargs)[source]
Bases:
sqlalchemy.orm.decl_api.BaseA class representing expected files to actually be processed.
- BLOCK_SIZE = 65536
- archive_path
- cached_file
- cached_file_id
- property file_hash
Hash the expected file.
- Returns
The hash object.
- file_type
- hash
- id
- last_processed
- meta
- path
- processed
- size
- class ketl.db.models.ExpectedMode(value)[source]
Bases:
enum.EnumAn enum representing the various ways of generating
ExpectedFiles fromCachedFiles.- auto = 'auto'
- explicit = 'explicit'
- self = 'self'
- exception ketl.db.models.InvalidConfigurationError[source]
Bases:
ExceptionException indicating an invalid configuration.
- class ketl.db.models.Source(**kwargs)[source]
Bases:
sqlalchemy.orm.decl_api.BaseA class representing a source of some data. Can be subclassed on source type.
- api_config
- api_config_id
- base_url
- data_dir
- property expected_files: List[ketl.db.models.ExpectedFile]
Return a list of expected files for the given source.
- Returns
a list of
ExpectedFiles.
- id
- meta
- source_files
- property source_hash
- source_type