converting only parts of a subversion repository from git_问答_开发者

I have an old Subversion repository with lots of my private projects. Parts of it where converted from an old CVS repository some years ago (with cvs2svn or similar). Its current structure looks like this:

trunk
- latex
- java
  - awt-doku
  - pps
    - build.xml
    - src
      - ant
      - de
        
        dclj
        
        faq
        
        paul
        
        (about 20 other packages)
        
        ltxdoclet
        
        (some java files)
- lua
- (other directories)
branches
tags
import

A problem is that I did quite some reorganization on this repository - for example, all the contents of the pps directory was once in a subdirectory of import (I think I imported it there from CVS), and there may have been other movements, too.

I'm now interested in the contents of the ltxdoclet directory together with some other files along the path, like build.xml, the ant directory and so on. And I want to have their whole history, including any history开发者_如何学运维 before moving the files. And I want it as a git repository now (since I want to publish this on github). The tags and branches were never really used, so they are not important.

I do not want the rest of this repository (they'll get separate git repositories sometimes) - this would blow up my repository too much (and there is some stuff I don't want to publish).

Ideally, my resulting git repository (in the HEAD state) should look like this:

pps
- build.xml
- src
  - ant
  - de
    - dclj
      - paul
        
        ltxdoclet
        
        (some java files)

I don't really care about historical directory configurations, but the history should not contain any commits who did not touch any of the files in these directories (or their predecessors).

Of course, git svn seems to be the tool of choice. (Are there others?)

git svn clone seems to be the right command ... but with which options? I created an authors.txt to convert the CVS or SVN user names to my name and address. To have only the interesting files and directories, I use --ignore-paths.

This was my try:

filter='^/xcb-src/|_00|src/resources|dclj/faq|dclj/paul/([^l]|l[^t])'
git svn clone svn+ssh://mathe-svn/ --trunk trunk/java/pps -A authors.txt --ignore-paths=$filter latexdoclet

Of course, it shows only the history after commit 2306, when I moved import/java-pps to trunk/java/pps ... and it has lots of commits which have no changes at all.

To solve the first problem, I thought about giving also the old directory as --trunk:

git svn clone svn+ssh://mathe-svn/ --trunk trunk/java/pps --trunk import/java-pps -A authors.txt --ignore-paths=$filter latexdoclet

This does not work, the first --trunk is ignored here, and it ends effectively on commit 2305 (before the move). (And it also contains lots of empty commits.)

My current try is to import the whole repository, filtering out anything not wanted:

filter='/xcb-src/|_00|src/resources|dclj/faq|dclj/paul/([^l]|l[^t])|/esperanto|finanzen|diverses|homepage|konfig|lua|prog-aufgaben|CVSROOT|latex|tags/'
git svn clone svn+ssh://mathe-svn/ -A authors.txt --ignore-paths=$filter latexdoclet-neu

The conversion is still running, but there certainly are lots of commits I don't want at all.

Edit: conversion completed - I now have 2658 commits (3176 objects in git), and only about 36 of them have some interesting tree change, if I configured my gitk filter right. (+ about 3 more which were erroneously filtered out, since our latex source file was first in the latex directory.)

Does anyone has better ideas on how to do this?
Should I better import the whole repository first and then use git filter-branch to pick out the files and commits I want?

Here what I did, for reference.

After the answer from Dustin I first converted the whole svn repository to git, with

 git svn clone -A authors.txt svn+ssh://mathe-svn/ all-projects

This got me a quite huge git repository of 24241 objects and 24 MBs (after packing), from a git repository of 45 MB. As already said a comment, both had 2658 commits in a linear history, so nothing was lost until now.

Then I started to filter things out ... from the filters offered by git filter-branch, the --index-filter one seemed to be the most useful, since it does not need to checkout anything (compared to --tree-filter), and I did not want to rewrite the metadata, only remove unwanted files.

Additionally, --prune-empty would be useful, too. I also used -d /dev/shm/ebermann/git-work/tmp to put the working directory in a tmpfs, but I don't know if this really mattered, since I did no checkouts here. I used the --original option to conserve the original master reference under a new name. (Why doesn't filter-branch allow simply creating a new branch and let the old one intact?)

As my tree-filter, I used git rm --cached -r --ignore-unmatch, to which I fed a list of files and directories by xargs.

So, I had multiple calls of

git filter-branch           \
  -d /dev/shm/ebermann/git-work/tmp  \
   --index-filter "
xargs -a ~/projektoj/git-conversion/remove-liste-5.txt git rm --cached -r --ignore-unmatch 
"        \
   --original "step8"       \
   master

and

git filter-branch \
  -d  /dev/shm/ebermann/git-work/tmp  \
  --prune-empty \
  --original "step9" \
  master

Between, I took a look at the created branch with gitk, looking for files I forgot before. The first file list I created from the output of svn ls svn+ssh://mathe-svn/path, removing the files/directories I wanted to retain. I later had to repeat this for older revisions, since some files were renamed (or more exactly, whole directory trees were moved) before, so the old names did not show up. Also, some files were removed before the current revision.

Now I have my master branch reduced to 40 revisions, and my HEAD contains 39 files and directories.

The repository (only this branch cloned in a new repository) now is only 180 KB big (with a working tree of 288 KB). I'll now go and clean up the commit comments (which often have nothing at all to do with this project), and then publish it on github.

For the next time, is there some command which creates a list of all file paths which have ever existed in my repository (without checking all revisions out and for each invoking find or such)? (Either for git or svn would be okay.)

Yes, learn filter-branch and do all the edits after the conversion. You can do it incrementally and reverse each step if you get it wrong.