infopath Posted March 9, 2007 Share Posted March 9, 2007 Hi all! Two weeks ago we started to have replication problems.We have more than 1050 domain controllers, each in separate office branch (one site for each DC). There is a HUB site, where four DCs are situated. Each of them is Bridgehead.Each branch has a cisco 1711 router which make VPN connection to the HUB office. WAN Speed: 1 MBit/s download; 128KBit/s upload - ADSL.On some servers when you try to replicate the domain naming context (dc=ourdomain,dc=net) we get "DS_REPLICA SYNC FAILED: The remote procedure call was cancelled".We've installed http://support.microsoft.com/kb/898060/en , but still no luck. There are no firewalls turned on.Operating system: Windows Server 2003 Std. with SP1.Anyone can help? Link to comment Share on other sites More sharing options...
jeff.sadowski Posted March 9, 2007 Share Posted March 9, 2007 What kind of diagnostics have you run so far? Link to comment Share on other sites More sharing options...
annakin108 Posted March 9, 2007 Share Posted March 9, 2007 I can't even think of a good reason for that many DC's.... Back to the drawing board...Is there some sort of leagle reason? Is it politics? How big is you environment? How many users and such? MS has a new tool for monitoring this stuff so have you spoken with support? Link to comment Share on other sites More sharing options...
infopath Posted March 9, 2007 Author Share Posted March 9, 2007 I can't even think of a good reason for that many DC's.... Back to the drawing board...Is there some sort of leagle reason? Is it politics? How big is you environment? How many users and such? MS has a new tool for monitoring this stuff so have you spoken with support?I'll not comment the scale of the project - Microsoft's local branch was involved - thats not "our" idea. This is really very large project - about 500 000 users will use the system. Here is a DCDIAG log:==================================================== ......................... OUR-BAD-DC passed test Replications Starting test: NCSecDesc ......................... OUR-BAD-DC passed test NCSecDesc Starting test: NetLogons ......................... OUR-BAD-DC passed test NetLogons Starting test: Advertising ......................... OUR-BAD-DC passed test Advertising Starting test: KnowsOfRoleHolders ......................... OUR-BAD-DC passed test KnowsOfRoleHolders Starting test: RidManager ......................... OUR-BAD-DC passed test RidManager Starting test: MachineAccount ......................... OUR-BAD-DC passed test MachineAccount Starting test: Services NtFrs Service is stopped on [OUR-BAD-DC] ......................... OUR-BAD-DC failed test Services Starting test: ObjectsReplicated ......................... OUR-BAD-DC passed test ObjectsReplicated Starting test: frssysvol ......................... OUR-BAD-DC passed test frssysvol Starting test: frsevent ......................... OUR-BAD-DC passed test frsevent Starting test: kccevent ......................... OUR-BAD-DC passed test kccevent Starting test: systemlog An Error Event occured. EventID: 0x00000457 Time Generated: 03/09/2007 21:51:04 (Event String could not be retrieved) An Error Event occured. EventID: 0x00000457 Time Generated: 03/09/2007 21:51:04 (Event String could not be retrieved) ......................... OUR-BAD-DC failed test systemlog Starting test: VerifyReferences ......................... OUR-BAD-DC passed test VerifyReferences Running partition tests on : ForestDnsZones Starting test: CrossRefValidation ......................... ForestDnsZones passed test CrossRefValidation Starting test: CheckSDRefDom ......................... ForestDnsZones passed test CheckSDRefDom Running partition tests on : DomainDnsZones Starting test: CrossRefValidation ......................... DomainDnsZones passed test CrossRefValidation Starting test: CheckSDRefDom ......................... DomainDnsZones passed test CheckSDRefDom Running partition tests on : Schema Starting test: CrossRefValidation ......................... Schema passed test CrossRefValidation Starting test: CheckSDRefDom ......................... Schema passed test CheckSDRefDom Running partition tests on : Configuration Starting test: CrossRefValidation ......................... Configuration passed test CrossRefValidation Starting test: CheckSDRefDom ......................... Configuration passed test CheckSDRefDom Running partition tests on : OUR-DOMAIN Starting test: CrossRefValidation ......................... OUR-DOMAIN passed test CrossRefValidation Starting test: CheckSDRefDom ......................... OUR-DOMAIN passed test CheckSDRefDom Running enterprise tests on : OUR-DOMAIN.NET Starting test: Intersite ......................... OUR-DOMAIN.NET passed test Intersite Starting test: FsmoCheck ......................... OUR-DOMAIN.NET passed test FsmoCheck===================================================As an addition we do not use File Replication Service for sysvol relpication. Robocopy works beter in case of more than 1000 DCs (as Microsoft describe)REPADMIN /BIND direct_repl_partnes - the operation succeedes.I made some network sniffing but I don't uderstand that RPC communication.When I try to do replication - there is bidirectional network traffic - from the BADDC to HUB DC and opposite. But a few seconds after the RPC call is cancelled.In the sniff result somethimes RPC BIND request is missing?!?This problem occures only with the domain partition...as example dc=configuration,dc=domain,dc=net naming context replicates perfectly!I did more investigation: created one "foo" computer account in the HUB DC.I cannot replicate this change to the bad dc - "RPC was cancelled...."BUT if I use REPADMIN /REPLSINGLEOBJ our-bad-dc source_dsa_GUID "foo_wks_account_dn" -> IT REPLICATES?!?!?Confusing...eah? Link to comment Share on other sites More sharing options...
cluberti Posted March 10, 2007 Share Posted March 10, 2007 Perhaps network traces and replmon data would assist - that error code signifies a bus reset, usually due to TCP timeouts. You'll likely need network traces and replmon data to catch this. Link to comment Share on other sites More sharing options...
rion Posted March 14, 2007 Share Posted March 14, 2007 If you have 500 000 users and 1000 dcs?.. why dont contact MS directly?With a company at that scale you should have a Permier contract ... Just my two cents Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now