ketl.extractor package

Submodules

ketl.extractor.Extractor module

class ketl.extractor.Extractor.BaseExtractor[source]

Bases: object

abstract extract() List[pathlib.Path][source]
class ketl.extractor.Extractor.DefaultExtractor(api_config: Union[ketl.db.models.API, int, str], skip_existing_files: bool = False, overwrite_on_extract=True, show_progress: bool = False, concurrency: str = 'sync', on_disk_check='full', expected_file_generation='incremental')[source]

Bases: ketl.extractor.Extractor.BaseExtractor

The default extractor can fetch files from an FTP server or any location that is openable via smart_open. It is up to the user to provide any credentials that are required to access the desired resources.

BLOCK_SIZE = 16384
extract() List[pathlib.Path][source]
Run the extractor. Attempts to minimize the amount of repeated work by checking

which cached files actually exist, whether on disk or in the database, and batching downloads. Optionally distributes the work across processes if the concurrency parameter is set to multiprocess.

Returns

a list of paths corresponding to all the ExpectedFile s that the extractor’s API is responsible for.

get_file(cached_file_id: int, source_url: str, target_file: pathlib.Path, refresh_interval: datetime.timedelta, url_params=None, show_progress=False, force_download=False) Optional[dict][source]

Download a file either using the FTP downloader or the generic downloader.

Parameters
  • cached_file_id – the id of the cached file.

  • source_url – the url from which to get the file.

  • target_file – the path to which the file shoudl be downloaded.

  • refresh_interval – the maximum age of the file.

  • url_params – optional query parameters.

  • show_progress – whether to show a tqdm progress bar.

  • force_download – whether to force download regardless of file presence.

Returns

A dict that contains the updated data for the cached file.

ketl.extractor.Rest module

class ketl.extractor.Rest.RestMixin[source]

Bases: object

A mixin that contains calls that use a REST API.

get(base_url, resource, params=None, data_schema: Optional[marshmallow.schema.Schema] = None, result_schema: Optional[marshmallow.schema.Schema] = None, **kwargs)[source]

Get a resource.

Parameters
  • base_url – the URL.

  • resource – the resource to get (gets URL/resource).

  • params – optional URL parameters.

  • data_schema – an optional schema to validate submitted data.

  • result_schema – an optional schema to validate returned data.

  • kwargs – additional keyword args.

Returns

a JSON result.

post(base_url, resource, data=None, json=None, data_schema: Optional[marshmallow.schema.Schema] = None, result_schema: Optional[marshmallow.schema.Schema] = None, **kwargs)[source]

Post a resource.

Parameters
  • base_url – the URL.

  • resource – the resource to get (gets URL/resource).

  • params – optional URL parameters.

  • data_schema – an optional schema to validate submitted data.

  • result_schema – an optional schema to validate returned data.

  • kwargs – additional keyword args.

Returns

a JSON result.

put(base_url, resource, data=None, json=None, data_schema: Optional[marshmallow.schema.Schema] = None, result_schema: Optional[marshmallow.schema.Schema] = None, **kwargs)[source]

Put a resource.

Parameters
  • base_url – the URL.

  • resource – the resource to get (gets URL/resource).

  • params – optional URL parameters.

  • data_schema – an optional schema to validate submitted data.

  • result_schema – an optional schema to validate returned data.

  • kwargs – additional keyword args.

Returns

a JSON result.

Module contents