$ git init
$ dvc init
$ dvc add < path to data>
$ cat data/.gitignore # to show thata the data is not being tracked
$ git add data/datafile.dvc data/.gitignore # we send the datafile.dvc to git instead of the data. This is how both data stay connected
- Examine data/.gitignore to show that the data is not being sent to git
- set up remote storage for data versioning
$ git status # to show files in the staging area
$ git commit both data/datafile.dvc data/.gitignore
$ cat data/datafile.dvc # to see the cryptic name created to rep the file
$ dvc remote add -d dvcOrigin < to storage location>
$ cat .dvc/config # shows the remote location in our config file
$ git commit -m " Configure the remote storage for dvc" # commit the configuration file
$ dvc push # push file to remote storage
$ rm -f raw_data/fake_file.csv # remove the raw dataset file
$ rm -rf .dvc/cache/ # remove the dvc cache
$ ls raw_data # show that the data is deleted
$ dvc pull # to bring back the raw dataset
$ cp raw_data/fake_file.csv /tmp/fake_file.csv # copy the dataset to another directory
$ cp raw_data/fake_file.csv ./tmp/fake_file.csv # confirm copied data
$ ls -lh raw_data/fake_file.csv
$ ls -lh tmp/fake_file.csv # Check the saize of the dataset
$ cat tmp/fake_file.csv >> raw_data/fake_file.csv # concatenate the dataset
$ $ ls -lh raw_data/fake_file.csv # check the size again to make sure this is done
$ dvc add raw_data/fake_file.csv # add the new file
$ git add raw_data/fake_file.csv.dvc # track with git
$ git status
$ git commit -m "A new file was receievd from data owner"
$ dvc push # pushes the new file to our remote storage
$ git checkout HEAD~1 raw_data/fake_file.csv.dvc # checkout theoriginal file
$ dvc checkout # we do not need to do "dvc add" after this one cos we already have a copy of this version in storage
Now we have two versions of the same data set in our remote storage. We can go back and forth between these datasets
$ git checkout HEAD~1 raw_data/fake_file.csv.dvc # check out the .dvc here
$ dvc checkout
$ ls -lh raw_data/fake_file.csv # check the size again
$ git commit -m "Revert back to the raw dataset" # commit the new .dvc file
$ git checkout -b hadev # check out your development branch
$ git remote add origin <remote url> # add remote
$ git branch -M main # sets the main path from the main branch
$ git push -u origin main #
- Examine the remote branch
You can download data from a repo into your local directory. As long as the data in that repo is dvc tracked. You
$ dvc get <remote url < to the data.dvc> # Option 1
$ dvc add newTxt.txt # add the file to dvc to track the data
$ git add newTxt.txt.dvc .gitignore # start tracking the file
$ dvc push # add to remote
- Let us make some changes to this file we just downloaded.
$ dvc add newTxt.txt # add the file to dvc to track the data
$ git add newTxt.txt.dvc .gitignore # start tracking the file
$ dvc push # add to remote
- Both the source repo and target repo now have access to this file.
$ cat newTxt.txt # display the new text in the target directory
$ cat newData/newTxt.txt # display the old file in the source directroy.
$ git pull # update the source directory to reflect the changes
$ # back to the target directory
$ git checkout <SHA1> before the file is modified
$ dvc fetch -aT # -aT flags to get the DVC-tracked data from all Git branches and tags from remote storage to the cache
$ dvc checkout # gets the old data
$ git commit -m # commits changes
$ dvc push # to push to the remote storage
$ gp origin main # update remote
At the source directory ................
$ git checkout <SHA1> of the version that you want </file.dvc> # put you at that version
$ dvc checkout # checkout the version
$ cat <filename> to see the new version
- Though the data is version controlled, this first option does not allow us to track where the file came from.
- This option will help track where we download the data from
- we can actually update the exact same file
$ # Update the repos < git push or git push>
$ dvc import https://github.com/aadehamid/dvc-examples.git \ ✔ 7183 06:30:36
> getDVCImport/dvc_import.txt -o dvc_import.txt
$ # if sometimes has passed and you have not pushed to the repo and the file was modified by the owner and version controlled. you can use the below command to update the imported file.
$ dvc update </file.dvc>
$ git add dvc_import.txt.dvc
$ git commit -m <msg></msg>
$ dvc push
$ cat dvc_import.txt.dvc # Metadat information on where we got the data from
$ dvc add dvc_import.txt # adds after the file is updated
$ git add dvc_import.txt.dvc # set up for git versioning
$ git commit -m "first time: Modified the raw file for dvcImport demo"
$ dvc push
- At the source we first update git with {git pull and/or push} and dvc with {dvc fetch -aT}
$ # checkout the version that you want
$ git checkout <SHA1 of the version></SHA1>
$ dvc checkout # will give you this version