Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
E
ebulk
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
nexedi
ebulk
Commits
e33c1e6e
Commit
e33c1e6e
authored
Oct 27, 2020
by
Roque
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Update readme with installation steps
parent
a49e39ff
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
71 additions
and
29 deletions
+71
-29
README.md
README.md
+71
-29
No files found.
README.md
View file @
e33c1e6e
...
...
@@ -9,16 +9,58 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
-
Many binary formats (ndarray, video, etc.)
-
Trade secret
# PROJECT CONTENT:
-
Bash script for ingestion and download
-
Embulk plugins
-
Configuration files (yml)
# REQUIREMENTS
This tool relies on
**Embulk**
Java application (see
[
docs
](
http://www.embulk.org/
)
).
Please make sure that
[
Java 8
](
http://www.oracle.com/technetwork/java/javase/downloads/index.html
)
is installed.
After installing the package and in the first use, the bash script will try to install Embulk automatically (if it is not installed).
# INSTALL
Please use the package installation for your operative system and follow the installation instructions.
## Linux
Ebulk package available in ubuntu-ppa repository allows to easily install the tool using apt commands.
Make sure
`software-properties-common`
is installed in order to run all apt commands:
```
sudo apt-get install software-properties-common
```
Add the ppa repository:
```
sudo add-apt-repository ppa:rporchetto/ebulk-ppa
```
Update your local sources and install ebulk:
```
sudo apt-get update
sudo apt-get install ebulk
```
## Debian considerations
For any OS version/series inconvenient during apt installation, it is recommended to install ebulk from the
`.deb`
package directly.
Please download the latest
`.deb`
ebulk package and install it by running:
```
dpkg -i ebulk_package.deb
```
## Mac OS X
Installation on Mac OS can be done via homebrew packages by running:
```
brew install https://github.com/roquegit/homebrew-ebulk/raw/master/ebulk.rb
```
## Potential installation issues
During the package intallation, or during first ebulk execution, the bash script will try to install Embulk automatically (if it is not installed).
If your OS needs special permission, it maybe will be necessary to install Embulk v 0.9.7 manually:
curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.bintray.com/embulk/maven/embulk-0.9.7.jar"
...
...
@@ -30,9 +72,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
To start the download, run the following command:
```
ebulk pull <DATA_SET>
```
```
ebulk pull <DATA_SET>
```
being
`<DATA_SET>`
the dataset reference showed in the site.
(e.g.
**ebulk pull my-dataset**
)
...
...
@@ -47,9 +89,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
If there is need to specify the chunk size for split download (e.g. due to memory errors with big files),
run the command with these parameters:
```
ebulk pull <DATA_SET> -c <CHUNK_SIZE>
```
```
ebulk pull <DATA_SET> -c <CHUNK_SIZE>
```
being
`<CHUNK_SIZE>`
an integer to set the size in Mb.
(e.g.
**ebulk pull my-dataset 10**
)
...
...
@@ -57,9 +99,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
# CUSTOMIZE OUTPUT DIRECTORY
Allows to use a custom output directory, different from the dataset reference. That location will be linked to the dataset reference.
```
ebulk pull <DATA_SET> -d <PATH>
```
```
ebulk pull <DATA_SET> -d <PATH>
```
being
`<PATH>`
the output location of the downloaded files.
(e.g.
**ebulk pull my-dataset -d some/different/path**
)
...
...
@@ -73,9 +115,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
# INGESTION QUICK START
To start the ingestion, run the following command:
```
ebulk push <DATA_SET>
```
```
ebulk push <DATA_SET>
```
being
`<DATA_SET>`
the dataset reference for your dataset, and the input directory where the files are.
(e.g.
**ebulk pull my-dataset**
)
...
...
@@ -95,9 +137,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
-
Amazon web service S3: s3
To use one of those storages as input, run the following command:
```
ebulk push <DATA_SET> --storage <STORAGE>
```
```
ebulk push <DATA_SET> --storage <STORAGE>
```
being
`<STORAGE>`
one of the following available inputs: ftp, http, s3
(e.g.
**ebulk push my-dataset --storage http**
)
...
...
@@ -108,16 +150,16 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
The user can edit the Embulk configuration file of the selected storage to run more complex scenarios
*
Please keep in mind that some knowledge about Embulk is required
```
ebulk push <DATA_SET> --storage <STORAGE> --advanced
```
```
ebulk push <DATA_SET> --storage <STORAGE> --advanced
```
# CUSTOM
The user can request the installation of a new input storage, running the following command:
```
ebulk push <DATA_SET> --custom-storage
```
```
ebulk push <DATA_SET> --custom-storage
```
The tool will request the user to input the desired Embulk input plugin (gem) in order to install it.
The input gem can be pick from here: http://www.embulk.org/plugins/
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment