ketl.extractor package
Submodules
ketl.extractor.Extractor module
- class ketl.extractor.Extractor.DefaultExtractor(api_config: Union[ketl.db.models.API, int, str], skip_existing_files: bool = False, overwrite_on_extract=True, show_progress: bool = False, concurrency: str = 'sync', on_disk_check='full', expected_file_generation='incremental')[source]
Bases:
ketl.extractor.Extractor.BaseExtractorThe default extractor can fetch files from an FTP server or any location that is openable via smart_open. It is up to the user to provide any credentials that are required to access the desired resources.
- BLOCK_SIZE = 16384
- extract() List[pathlib.Path][source]
- Run the extractor. Attempts to minimize the amount of repeated work by checking
which cached files actually exist, whether on disk or in the database, and batching downloads. Optionally distributes the work across processes if the concurrency parameter is set to multiprocess.
- Returns
a list of paths corresponding to all the
ExpectedFiles that the extractor’s API is responsible for.
- get_file(cached_file_id: int, source_url: str, target_file: pathlib.Path, refresh_interval: datetime.timedelta, url_params=None, show_progress=False, force_download=False) Optional[dict][source]
Download a file either using the FTP downloader or the generic downloader.
- Parameters
cached_file_id – the id of the cached file.
source_url – the url from which to get the file.
target_file – the path to which the file shoudl be downloaded.
refresh_interval – the maximum age of the file.
url_params – optional query parameters.
show_progress – whether to show a tqdm progress bar.
force_download – whether to force download regardless of file presence.
- Returns
A dict that contains the updated data for the cached file.
ketl.extractor.Rest module
- class ketl.extractor.Rest.RestMixin[source]
Bases:
objectA mixin that contains calls that use a REST API.
- get(base_url, resource, params=None, data_schema: Optional[marshmallow.schema.Schema] = None, result_schema: Optional[marshmallow.schema.Schema] = None, **kwargs)[source]
Get a resource.
- Parameters
base_url – the URL.
resource – the resource to get (gets URL/resource).
params – optional URL parameters.
data_schema – an optional schema to validate submitted data.
result_schema – an optional schema to validate returned data.
kwargs – additional keyword args.
- Returns
a JSON result.
- post(base_url, resource, data=None, json=None, data_schema: Optional[marshmallow.schema.Schema] = None, result_schema: Optional[marshmallow.schema.Schema] = None, **kwargs)[source]
Post a resource.
- Parameters
base_url – the URL.
resource – the resource to get (gets URL/resource).
params – optional URL parameters.
data_schema – an optional schema to validate submitted data.
result_schema – an optional schema to validate returned data.
kwargs – additional keyword args.
- Returns
a JSON result.
- put(base_url, resource, data=None, json=None, data_schema: Optional[marshmallow.schema.Schema] = None, result_schema: Optional[marshmallow.schema.Schema] = None, **kwargs)[source]
Put a resource.
- Parameters
base_url – the URL.
resource – the resource to get (gets URL/resource).
params – optional URL parameters.
data_schema – an optional schema to validate submitted data.
result_schema – an optional schema to validate returned data.
kwargs – additional keyword args.
- Returns
a JSON result.